Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.05499
Cited By
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
9 March 2023
Shilong Liu
Zhaoyang Zeng
Tianhe Ren
Feng Li
Hao Zhang
Jie-jin Yang
Chun-yue Li
Jianwei Yang
Hang Su
Jun Zhu
Lei Zhang
ObjD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
50 / 1,338 papers shown
Title
Parameter-Efficient Fine-Tuning for Foundation Models
Dan Zhang
Tao Feng
Lilong Xue
Yuandong Wang
Yuxiao Dong
J. Tang
51
9
0
23 Jan 2025
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Fu Rong
Meng Lan
Qian Zhang
Lefei Zhang
VOS
VGen
73
1
0
23 Jan 2025
DynamicEarth: How Far are We from Open-Vocabulary Change Detection?
Kaiyu Li
Xiangyong Cao
Yupeng Deng
Chao Pang
Zepeng Xin
Deyu Meng
Zhi Wang
ObjD
75
1
0
22 Jan 2025
Can masking background and object reduce static bias for zero-shot action recognition?
Takumi Fukuzawa
Kensho Hara
Hirokatsu Kataoka
Toru Tamaki
43
0
0
22 Jan 2025
ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality
Yanming Xiu
T. Scargill
M. Gorlatova
75
2
0
22 Jan 2025
ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions
Shiyue Zhang
Zheng Chong
Xi Lu
Wenqing Zhang
Haoxiang Li
Xujie Zhang
Jiehui Huang
Xiao Dong
Xiaodan Liang
DiffM
46
0
0
21 Jan 2025
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Zhenhailong Wang
Haiyang Xu
Junyang Wang
Xi Zhang
Ming Yan
J. Zhang
Fei Huang
Heng Ji
47
10
0
20 Jan 2025
Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks
Michael Schwingshackl
Fabio Francisco Oberweger
Markus Murschitz
48
1
0
20 Jan 2025
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
Ruixuan Zhang
Beichen Wang
Juexiao Zhang
Zilin Bian
Chen Feng
K. Ozbay
45
3
0
17 Jan 2025
Enhancing Novel Object Detection via Cooperative Foundational Models
Rohit K Bharadwaj
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
ObjD
VLM
164
1
0
17 Jan 2025
Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAM
Xin Hu
Janet Wang
Jihun Hamm
R. Yotsu
Zhengming Ding
103
0
0
17 Jan 2025
VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance
Divyansh Srivastava
Beatriz Cabrero-Daniel
Christian Berger
VLM
67
8
0
17 Jan 2025
Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites
Abdalwhab Abdalwhab
A. Imran
Sina Heydarian
I. Iordanova
David St-Onge
54
0
0
16 Jan 2025
SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
Varun Biyyala
Bharat Chanderprakash Kathuria
Jialu Li
Youshan Zhang
55
0
0
13 Jan 2025
Guided SAM: Label-Efficient Part Segmentation
S.B. van Rooij
G.J. Burghouts
VLM
48
0
0
13 Jan 2025
Toward Realistic Camouflaged Object Detection: Benchmarks and Method
Zhimeng Xin
Tianxu Wu
Shiming Chen
Shuo Ye
Zijing Xie
Yixiong Zou
Xinge You
Yufei Guo
33
0
0
13 Jan 2025
Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning
Juntao Ren
Priya Sundaresan
Dorsa Sadigh
Sanjiban Choudhury
Jeannette Bohg
50
15
0
13 Jan 2025
Static Segmentation by Tracking: A Frustratingly Label-Efficient Approach to Fine-Grained Segmentation
Zhenyang Feng
Zihe Wang
Saul Ibaven Bueno
Tomasz Frelek
Advikaa Ramesh
...
Hilmar Lapp
Charles V. Stewart
T. Berger-Wolf
Yu-Chuan Su
Wei-Lun Chao
53
0
0
12 Jan 2025
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Ming Dai
Jian Li
Jiedong Zhuang
Xian Zhang
Wankou Yang
ObjD
44
1
0
12 Jan 2025
Multi-subject Open-set Personalization in Video Generation
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Yuwei Fang
Kwot Sin Lee
Ivan Skorokhodov
Kfir Aberman
Jun-Yan Zhu
Ming Yang
Sergey Tulyakov
DiffM
VGen
69
7
0
10 Jan 2025
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Yuzhou Huang
Ziyang Yuan
Quande Liu
Qiulin Wang
Xintao Wang
Ruimao Zhang
Pengfei Wan
Di Zhang
Kun Gai
VGen
DiffM
47
10
0
08 Jan 2025
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Mingjie Pan
Jiyao Zhang
Tianshu Wu
Yinghao Zhao
Wenlong Gao
Hao Dong
LM&Ro
61
8
0
08 Jan 2025
Feedback-Driven Vision-Language Alignment with Minimal Human Supervision
Giorgio Giannone
Ruoteng Li
Qianli Feng
Evgeny Perevodchikov
Rui Chen
Aleix M. Martinez
VLM
66
0
0
08 Jan 2025
Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting
K. Gao
Liangzhi Li
Hongjie He
Dening Lu
Linlin Xu
Jonathan Li
GP
3DGS
32
2
0
08 Jan 2025
ORGANA: A Robotic Assistant for Automated Chemistry Experimentation and Characterization
Kourosh Darvish
Marta Skreta
Yuchi Zhao
Naruki Yoshikawa
Sagnik Som
...
Han Hao
Haoping Xu
Alán Aspuru-Guzik
Animesh Garg
Florian Shkurti
59
23
0
08 Jan 2025
Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance
Dongmin Park
Sebin Kim
Taehong Moon
Minkyu Kim
Kangwook Lee
Jaewoong Cho
DiffM
CoGe
70
2
0
08 Jan 2025
Dr. Tongue: Sign-Oriented Multi-label Detection for Remote Tongue Diagnosis
Yiliang Chen
Steven SC Ho
Cheng Xu
Yao Jie Xie
Wing-Fai Yeung
Shengfeng He
Jing Qin
LM&MA
30
0
0
06 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
88
12
0
06 Jan 2025
Instruction-Guided Scene Text Recognition
Yongkun Du
Z. Chen
Yuchen Su
Caiyan Jia
Yu-Gang Jiang
75
3
0
03 Jan 2025
Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models
Yifan Zhang
Junhui Hou
66
1
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
102
48
0
03 Jan 2025
VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
Zhipeng Chen
Lan Yang
Yonggang Qi
Honggang Zhang
Kaiyue Pang
Ke Li
Yi-Zhe Song
DiffM
95
0
0
31 Dec 2024
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
80
5
0
31 Dec 2024
YOLO-UniOW: Efficient Universal Open-World Object Detection
Lihao Liu
Juexiao Feng
Hui Chen
Ao Wang
Lin Song
J. Han
Guiguang Ding
ObjD
VLM
53
2
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
64
39
0
31 Dec 2024
MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation
Haoyu Zheng
Wenqiao Zhang
Zheqi Lv
Yu Zhong
Yang Dai
...
Yongliang Shen
Juncheng Billy Li
Dongping Zhang
Siliang Tang
Yueting Zhuang
DiffM
VGen
60
0
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
61
19
0
31 Dec 2024
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
58
4
0
31 Dec 2024
AI-Powered Urban Transportation Digital Twin: Methods and Applications
Xuan Di
Yongjie Fu
Mehmet K.Turkcan
Mahshid Ghasemi
Zhaobin Mo
Chengbo Zang
Abhishek Adhikari
Z. Kostić
Gil Zussman
AI4CE
39
0
0
30 Dec 2024
Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
X. Feng
D. Zhang
Shuyan Hu
X. Li
M. Wu
Jie Zhang
Xiaojing Chen
K. Huang
47
0
0
27 Dec 2024
Interacted Object Grounding in Spatio-Temporal Human-Object Interactions
Xiaoyang Liu
Boran Wen
Xinpeng Liu
Zizheng Zhou
Hongwei Fan
Cewu Lu
Lizhuang Ma
Yulong Chen
Yong Li
59
2
0
27 Dec 2024
Visual Prompting with Iterative Refinement for Design Critique Generation
Peitong Duan
Chin-yi Chen
Bjoern Hartmann
Yang Li
81
0
0
22 Dec 2024
Aria-UI: Visual Grounding for GUI Instructions
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
Chenyu Huang
Junnan Li
LM&Ro
LLMAG
111
18
0
20 Dec 2024
Towards a Training Free Approach for 3D Scene Editing
Vivek Madhavaram
Shivangana Rawat
Chaitanya Devaguptapu
Charu Sharma
Manohar Kaul
DiffM
79
0
0
17 Dec 2024
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
Rick Akkerman
Haiwen Feng
M. Black
Dimitrios Tzionas
Victoria Fernandez-Abrevaya
VGen
AI4CE
105
3
0
16 Dec 2024
ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction
Yi Feng
Yu Han
Xijing Zhang
Tanghui Li
Yanting Zhang
Rui Fan
117
3
0
15 Dec 2024
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
Shengqiong Wu
Hao Fei
Liangming Pan
William Yang Wang
Shuicheng Yan
Tat-Seng Chua
LRM
80
1
0
15 Dec 2024
Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm
Jinrong Zhang
Penghui Wang
Chunxiao Liu
Wei Liu
D. Jin
Qiong Zhang
Erli Meng
Zhengnan Hu
VLM
77
0
0
14 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip Torr
VLM
ObjD
245
0
0
12 Dec 2024
BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics
Keyi Shen
Jiangwei Yu
Huan Zhang
Yunzhu Li
Yunzhu Li
98
1
0
12 Dec 2024
Previous
1
2
3
...
6
7
8
...
25
26
27
Next