Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.05499
Cited By
v1
v2
v3
v4 (latest)
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
9 March 2023
Shilong Liu
Zhaoyang Zeng
Tianhe Ren
Feng Li
Hao Zhang
Jie Yang
Chun-yue Li
Jianwei Yang
Hang Su
Jun Zhu
Lei Zhang
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Github (8136★)
Papers citing
"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
50 / 690 papers shown
Title
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Arsh Koneru
Yusuke Kato
Kazuki Kozuka
Aditya Grover
VLM
147
5
0
15 Mar 2025
SPOC: Spatially-Progressing Object State Change Segmentation in Video
Priyanka Mandikal
Tushar Nagarajan
Alex Stoken
Zihui Xue
Kristen Grauman
79
0
0
15 Mar 2025
MTV-Inpaint: Multi-Task Long Video Inpainting
Shiyuan Yang
Zheng Gu
Liang Hou
Xin Tao
Pengfei Wan
Xiaodong Chen
Jing Liao
DiffM
77
2
0
14 Mar 2025
EmoAgent: A Multi-Agent Framework for Diverse Affective Image Manipulation
Qi Mao
Haobo Hu
Yujie He
Difei Gao
Haokun Chen
Libiao Jin
DiffM
85
0
0
14 Mar 2025
COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation
Sanghyun Jo
Seo Jin Lee
Seungwoo Lee
Seohyung Hong
Hyungseok Seo
Kyungsu Kim
82
0
0
14 Mar 2025
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
Chuhan Zhang
Chaoyang Zhu
Pingcheng Dong
Long Chen
Dong Zhang
ObjD
VLM
493
0
0
14 Mar 2025
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Weichen Zhang
Zile Zhou
Zhiheng Zheng
Chen Gao
Jinqiang Cui
Yongqian Li
Xinlei Chen
Xiao-Ping Zhang
LRM
137
5
0
14 Mar 2025
Attacking Multimodal OS Agents with Malicious Image Patches
Lukas Aichberger
Alasdair Paren
Y. Gal
Philip Torr
Adel Bibi
AAML
133
5
0
13 Mar 2025
Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
Zhen Qu
Xian Tao
Xinyi Gong
Shichen Qu
Qiyu Chen
Zhengtao Zhang
Xingang Wang
Guiguang Ding
VLM
173
1
0
13 Mar 2025
Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction
Yuhan Wang
Cheng Liu
Daou Zhang
Weichao Wu
103
0
0
13 Mar 2025
IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
Yiyang Ling
Karan Owalekar
Oluwatobiloba Adesanya
Erdem Bıyık
Daniel Seita
84
2
0
13 Mar 2025
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
Georgy Ponimatkin
Martin Cífka
Tomáš Souček
Médéric Fourmy
Yann Labbé
Vladimir Petrik
Josef Sivic
90
1
0
13 Mar 2025
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
Yufan Deng
Xun Guo
Yanjie Wang
Jacob Zhiyuan Fang
Angtian Wang
Shenghai Yuan
Yiding Yang
Bo Liu
Haibin Huang
Chongyang Ma
DiffM
VGen
156
3
0
13 Mar 2025
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation
Zixian Liu
Mingtong Zhang
Yunzhu Li
88
1
0
13 Mar 2025
PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
Runze He
Bo Cheng
Yuhang Ma
Qingxiang Jia
Shanyuan Liu
Ao Ma
Xiaoyu Wu
Liebucha Wu
Dawei Leng
Yuhui Yin
DiffM
VLM
187
0
0
13 Mar 2025
A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
Shenghao Fu
Junkai Yan
Q. Yang
Xihan Wei
Xiaohua Xie
Wei-Shi Zheng
ObjD
VLM
87
0
0
13 Mar 2025
OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer
Jinyang Li
En Yu
Sijia Chen
Wenbing Tao
168
2
0
13 Mar 2025
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Peng Chen
Pi Bu
Yingyao Wang
Xinyi Wang
Xiangqi Jin
...
Qi Zhu
Jun Song
Siran Yang
Jiamang Wang
Bo Zheng
117
2
0
12 Mar 2025
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
Zhihua Tian
Sirun Nan
Ming Xu
Shengfang Zhai
Wenjie Qu
Enchao Gong
Kui Ren
Ruoxi Jia
DiffM
147
2
0
12 Mar 2025
InteractEdit: Zero-Shot Editing of Human-Object Interactions in Images
Jiun Tian Hoe
Weipeng Hu
Wei Zhou
Chao Xie
Ziwei Wang
Chee Seng Chan
Xudong Jiang
Y. Tan
121
0
0
12 Mar 2025
Deep Learning for Climate Action: Computer Vision Analysis of Visual Narratives on X
Patrick Knab
Marcel Kleinmann
Inken Adam
Kerstin Beckersjuergen
Andreas Edte
...
Timotheus Gumpp
Steffen Jung
Isaac Bravo
Stefanie Walter
Margret Keuper
72
0
0
12 Mar 2025
DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
Chiara Cappellino
Gianluca Mancusi
Matteo Mosconi
Angelo Porrello
Simone Calderara
Rita Cucchiara
ObjD
VLM
189
0
0
12 Mar 2025
Online Language Splatting
Saimouli Katragadda
Cho-Ying Wu
Yuliang Guo
Xinyu Huang
Guoquan Huang
Liu Ren
3DGS
OffRL
116
0
0
12 Mar 2025
ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
Tobias Christian Nauen
Brian B. Moser
Federico Raue
Stanislav Frolov
Andreas Dengel
ViT
187
0
0
12 Mar 2025
FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech
Yuzhi Lai
Shenghai Yuan
Boya Zhang
Benjamin Kiefer
Peizheng Li
Tianchen Deng
Andreas Zell
73
1
0
11 Mar 2025
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
Weijie Zhou
Manli Tao
Chaoyang Zhao
Haiyun Guo
Honghui Dong
Ming Tang
Jinqiao Wang
112
2
0
11 Mar 2025
Referring to Any Person
Qing Jiang
Lin Wu
Zhaoyang Zeng
Tianhe Ren
Yuda Xiong
Yihao Chen
Qin Liu
Lei Zhang
507
2
0
11 Mar 2025
DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation
Sanghyun Jo
Ziseok Lee
Wooyeol Lee
Kyungsu Kim
138
2
0
11 Mar 2025
S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction
Guangting Zheng
Jiajun Deng
Xiaomeng Chu
Yu Yuan
Houqiang Li
Yanyong Zhang
3DGS
107
0
0
11 Mar 2025
Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding
Tim Steinke
Martin Buchner
Niclas Vodisch
Abhinav Valada
97
1
0
11 Mar 2025
YOLOE: Real-Time Seeing Anything
Ao Wang
Lihao Liu
Hui Chen
Zijia Lin
Jiawei Han
Guiguang Ding
VLM
ObjD
136
6
0
10 Mar 2025
Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing
Xing Zi
Kairui Jin
Xian Tao
Jun Li
Ali Braytee
Rajiv Ratn Shah
Mukesh Prasad
VLM
91
0
0
10 Mar 2025
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang
Zhen Han
Chaojie Mao
Junxuan Zhang
Yulin Pan
Yu Liu
DiffM
VGen
141
23
0
10 Mar 2025
Multi-Modal 3D Mesh Reconstruction from Images and Text
Melvin Reka
Tessa Pulli
Markus Vincze
89
0
0
10 Mar 2025
PE3R: Perception-Efficient 3D Reconstruction
Jie Hu
Shizun Wang
Xinchao Wang
119
1
0
10 Mar 2025
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Jiazheng Liu
Sipeng Zheng
Börje F. Karlsson
Zongqing Lu
64
0
0
10 Mar 2025
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
Hanzhi Chen
Boyang Sun
Anran Zhang
Marc Pollefeys
Stefan Leutenegger
LM&Ro
164
0
0
10 Mar 2025
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
Yan Tai
Luhao Zhu
Zhiqiang Chen
Ynan Ding
Yiying Dong
Xiaohong Liu
Guodong Guo
MLLM
ObjD
102
0
0
10 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
156
3
0
10 Mar 2025
Safety Guardrails for LLM-Enabled Robots
Zachary Ravichandran
Alexander Robey
Vijay Kumar
George Pappas
Hamed Hassani
126
5
0
10 Mar 2025
DreamRelation: Relation-Centric Video Customization
Yujie Wei
Shiwei Zhang
Hangjie Yuan
Biao Gong
Longxiang Tang
...
Haonan Qiu
Hengjia Li
Shuai Tan
Yize Zhang
Hongming Shan
VGen
134
1
0
10 Mar 2025
SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
Shijia Zhao
Qiming Xia
Xusheng Guo
Pufan Zou
Maoji Zheng
Hai Wu
Chenglu Wen
Cheng-Yu Wang
3DPC
135
0
0
09 Mar 2025
Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On
Roni Goldshmidt
MLLM
VLM
66
0
0
09 Mar 2025
OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection
Adrian Chow
Evelien Riddell
Yimu Wang
Sean Sedwards
Krzysztof Czarnecki
3DPC
77
0
0
09 Mar 2025
Get In Video: Add Anything You Want to the Video
Shaobin Zhuang
Zhipeng Huang
Binxin Yang
Ying Zhang
Fangyikang Wang
Canmiao Fu
Chong Sun
Zheng-Jun Zha
Chen Li
Yijiao Wang
DiffM
VGen
114
3
0
08 Mar 2025
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
Seil Kang
Jinyeong Kim
Junhyeok Kim
Seong Jae Hwang
VLM
127
5
0
08 Mar 2025
OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images
Ziyue Huang
Yongchao Feng
Shuai Yang
Ziqiang Liu
Qingjie Liu
Yansen Wang
ObjD
455
1
0
08 Mar 2025
FloPE: Flower Pose Estimation for Precision Pollination
Rashik Shrestha
Madhav Rijal
T. Smith
Yu Gu
101
0
0
08 Mar 2025
From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning
Shuangzhi Li
Junlong Shen
Lei Ma
Xingyu Li
3DPC
113
0
0
08 Mar 2025
Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting
Dominic Maggio
Luca Carlone
436
0
0
07 Mar 2025
Previous
1
2
3
...
6
7
8
...
12
13
14
Next