Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.05499
Cited By
v1
v2
v3
v4 (latest)
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
9 March 2023
Shilong Liu
Zhaoyang Zeng
Tianhe Ren
Feng Li
Hao Zhang
Jie Yang
Chun-yue Li
Jianwei Yang
Hang Su
Jun Zhu
Lei Zhang
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Github (8136★)
Papers citing
"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
50 / 691 papers shown
Title
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
168
10
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
166
30
0
31 Dec 2024
VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
Zhipeng Chen
Lan Yang
Yonggang Qi
Honggang Zhang
Kaiyue Pang
Ke Li
Yi-Zhe Song
DiffM
204
0
0
31 Dec 2024
Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting
K. Gao
Liangzhi Li
Hongjie He
Dening Lu
Linlin Xu
Jonathan Li
GP
3DGS
102
2
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
193
42
0
31 Dec 2024
YOLO-UniOW: Efficient Universal Open-World Object Detection
Lihao Liu
Juexiao Feng
Hui Chen
Ao Wang
Lin Song
Jiawei Han
Guiguang Ding
ObjD
VLM
138
2
0
31 Dec 2024
AI-Powered Urban Transportation Digital Twin: Methods and Applications
Xuan Di
Yongjie Fu
Mehmet K.Turkcan
Mahshid Ghasemi
Zhaobin Mo
Chengbo Zang
Abhishek Adhikari
Z. Kostić
Gil Zussman
AI4CE
121
0
0
30 Dec 2024
Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
X. Feng
D. Zhang
Shuyan Hu
X. Li
M. Wu
Jie Zhang
Xiaojing Chen
K. Huang
93
1
0
27 Dec 2024
Interacted Object Grounding in Spatio-Temporal Human-Object Interactions
Xiaoyang Liu
Boran Wen
Xinpeng Liu
Zizheng Zhou
Hongwei Fan
Cewu Lu
Lizhuang Ma
Yulong Chen
Yongqian Li
160
3
0
27 Dec 2024
Visual Prompting with Iterative Refinement for Design Critique Generation
Peitong Duan
Chin-Yi Cheng
Bjoern Hartmann
Yang Li
171
0
0
22 Dec 2024
Aria-UI: Visual Grounding for GUI Instructions
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
Chenyu Huang
Junnan Li
LM&Ro
LLMAG
178
33
0
20 Dec 2024
Towards a Training Free Approach for 3D Scene Editing
Vivek Madhavaram
Shivangana Rawat
Chaitanya Devaguptapu
Charu Sharma
Manohar Kaul
DiffM
140
0
0
17 Dec 2024
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
Rick Akkerman
Haiwen Feng
M. Black
Dimitrios Tzionas
Victoria Fernandez-Abrevaya
VGen
AI4CE
210
3
0
16 Dec 2024
ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction
Yi Feng
Yu Han
Xijing Zhang
Tanghui Li
Yanting Zhang
Rui Fan
304
3
0
15 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip Torr
VLM
ObjD
548
1
0
12 Dec 2024
BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics
Keyi Shen
Jiangwei Yu
Huan Zhang
Yunzhu Li
Yunzhu Li
180
1
0
12 Dec 2024
Open-Vocabulary High-Resolution 3D (OVHR3D) Data Segmentation and Annotation Framework
Jiuyi Xu
Meida Chen
Andrew Feng
Yangming Shi
Zifan Yu
100
0
0
09 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
216
8
0
05 Dec 2024
Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation
Xuanlin Li
Tong Zhao
Xinghao Zhu
Jiuguang Wang
Tao Pang
Kuan Fang
182
4
0
03 Dec 2024
Referring Video Object Segmentation via Language-aligned Track Selection
Seongchan Kim
Woojeong Jin
Sangbeom Lim
Heeji Yoon
Hyunwook Choi
Seungryong Kim
VOS
189
0
0
02 Dec 2024
HandOS: 3D Hand Reconstruction in One Stage
Xingyu Chen
Zhuheng Song
Xiaoke Jiang
Yaoqing Hu
Junzhi Yu
Lei Zhang
3DH
HAI
203
0
0
02 Dec 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLM
LRM
231
10
0
27 Nov 2024
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
Zizhao Li
Zhengkang Xiang
Joseph West
Kourosh Khoshelham
ObjD
VLM
200
1
0
27 Nov 2024
RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
Raktim Gautam Goswami
Prashanth Krishnamurthy
Yann LeCun
Farshad Khorrami
169
1
0
26 Nov 2024
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection
Zhongyu Xia
Jishuo Li
Zhiwei Lin
Xinhao Wang
Yansen Wang
Ming-Hsuan Yang
VLM
184
3
0
26 Nov 2024
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
Claudia Cuttano
Gabriele Trivigno
Gabriele Rosi
Carlo Masone
Giuseppe Averta
VOS
211
3
0
26 Nov 2024
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
175
2
0
25 Nov 2024
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation
Linqing Zhong
Chen Gao
Zihan Ding
Yue Liao
Si Liu
Shifeng Zhang
Xu Zhou
Si Liu
LRM
181
7
0
25 Nov 2024
Interpreting Object-level Foundation Models via Visual Precision Search
Ruoyu Chen
Siyuan Liang
Jingzhi Li
Shiming Liu
Maosen Li
Zheng Huang
Qichuan Geng
Xiaochun Cao
FAtt
239
5
0
25 Nov 2024
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Qifan Yu
Wei Chow
Zhongqi Yue
Kaihang Pan
Yang Wu
Xiaoyang Wan
Juncheng Billy Li
Siliang Tang
Hao Zhang
Yueting Zhuang
DiffM
244
29
0
24 Nov 2024
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions
Guanyu Zhou
Xiaohan Yu
Wenxin Huang
Xuemei Jia
Xian Zhong
Chia-Wen Lin
CML
125
0
0
24 Nov 2024
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Chaehun Shin
Jooyoung Choi
Heeseung Kim
Sungroh Yoon
DiffM
189
13
0
23 Nov 2024
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
Junzhe Chen
Tianshu Zhang
Shijie Huang
Yuwei Niu
Linfeng Zhang
Lijie Wen
Xuming Hu
MLLM
VLM
510
6
0
22 Nov 2024
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
Weijia Wu
Mingyu Liu
Zeyu Zhu
Xi Xia
Haoen Feng
Wen Wang
Kevin Qinghong Lin
Chunhua Shen
Mike Zheng Shou
DiffM
VGen
235
3
0
22 Nov 2024
VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing
Jiahao Hu
Tianxiong Zhong
Xuebo Wang
Boyuan Jiang
Xingye Tian
Fei Yang
Pengfei Wan
Di Zhang
VGen
126
3
0
22 Nov 2024
From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning
Pengkun Jiao
Bin Zhu
Jingjing Chen
Chong-Wah Ngo
Yu-Gang Jiang
VLM
OffRL
170
0
0
19 Nov 2024
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Andong Deng
Tongjia Chen
Shoubin Yu
Taojiannan Yang
Lincoln Spencer
Yapeng Tian
Ajmal Mian
Joey Tianyi Zhou
Chen Chen
LRM
113
3
0
15 Nov 2024
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
194
2
0
14 Nov 2024
Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints
Nishanth Kumar
F. Ramos
Dieter Fox
Caelan Reed Garrett
Tomás Lozano-Pérez
Leslie Pack Kaelbling
Caelan Reed Garrett
LRM
LM&Ro
148
5
0
13 Nov 2024
Semantic Enhancement for Object SLAM with Heterogeneous Multimodal Large Language Model Agents
Jungseok Hong
Ran Choi
John Leonard
VLM
156
1
0
11 Nov 2024
SuperQ-GRASP: Superquadrics-based Grasp Pose Estimation on Larger Objects for Mobile-Manipulation
Xun Tu
Karthik Desingh
178
2
0
07 Nov 2024
DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning
Zhenyu Jiang
Yuqi Xie
K. Lin
Zhenjia Xu
Weikang Wan
Ajay Mandlekar
Linxi Fan
Yuke Zhu
127
33
0
31 Oct 2024
Local Policies Enable Zero-shot Long-horizon Manipulation
Murtaza Dalal
Min Liu
Walter Talbott
Chen Chen
Deepak Pathak
Jian Zhang
Ruslan Salakhutdinov
146
4
0
29 Oct 2024
On-Robot Reinforcement Learning with Goal-Contrastive Rewards
Ondrej Biza
Thomas Weng
Lingfeng Sun
Karl Schmeckpeper
Tarik Kelestemur
Yecheng Jason Ma
Robert Platt
Jan-Willem van de Meent
Lawson L. S. Wong
OffRL
157
0
0
25 Oct 2024
BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering
Jiayue Dai
Yunya Wang
Yihan Fang
Yuetong Chen
Butian Xiong
VLM
68
0
0
19 Oct 2024
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
Chenhang Cui
An Zhang
Yiyang Zhou
Zhaorun Chen
Gelei Deng
Huaxiu Yao
Tat-Seng Chua
232
8
0
18 Oct 2024
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
Guangda Ji
Silvan Weder
Francis Engelmann
Marc Pollefeys
Hermann Blum
3DV
153
4
0
17 Oct 2024
In-Context Learning Enables Robot Action Prediction in LLMs
Yida Yin
Zekai Wang
Yuvan Sharma
Dantong Niu
Trevor Darrell
Roei Herzig
LM&Ro
293
4
0
16 Oct 2024
Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation
Zhijie Yan
Shufei Li
Ziyi Wang
Lixiu Wu
Han Wang
Jun Zhu
Lijiang Chen
Jihong Liu
161
5
0
15 Oct 2024
Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning
Yunpeng Gao
Zhigang Wang
Linglin Jing
Dong Wang
Xuelong Li
Bin Zhao
130
14
0
11 Oct 2024
Previous
1
2
3
...
10
11
12
13
14
9
Next