Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.05499
Cited By
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
9 March 2023
Shilong Liu
Zhaoyang Zeng
Tianhe Ren
Feng Li
Hao Zhang
Jie-jin Yang
Chun-yue Li
Jianwei Yang
Hang Su
Jun Zhu
Lei Zhang
ObjD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
50 / 1,337 papers shown
Title
A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability
Jie Zhu
Jirong Zha
Ding Li
Leye Wang
31
0
0
15 May 2025
Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data
Yiwen Liu
Jessica Bader
Jae Myung Kim
DiffM
18
0
0
15 May 2025
Towards Safe Robot Foundation Models Using Inductive Biases
Maximilian Tölle
Theo Gruner
Daniel Palenicek
Tim Schneider
Jonas Günster
Joe Watson
Davide Tateo
Puze Liu
Jan Peters
OffRL
AI4CE
27
0
0
15 May 2025
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation
Enyu Zhao
Vedant Raval
Hejia Zhang
Jiageng Mao
Zeyu Shangguan
Stefanos Nikolaidis
Yishuo Wang
Daniel Seita
LM&Ro
CoGe
48
0
0
14 May 2025
Air-Ground Collaboration for Language-Specified Missions in Unknown Environments
Fernando Cladera
Zachary Ravichandran
Jason Hughes
Varun Murali
Carlos Nieto-Granda
M. Hsieh
George J. Pappas
Camillo J Taylor
Vijay Kumar
39
1
0
14 May 2025
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su
Linjie Li
Mingyang Song
Yunzhuo Hao
Zhengyuan Yang
...
Guanjie Chen
Jiawei Gu
Juntao Li
Xiaoye Qu
Yu Cheng
OffRL
LRM
31
0
0
13 May 2025
Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies Towards Visual Robustness
Reihaneh Mirjalili
Tobias Jülg
Florian Walter
Wolfram Burgard
32
0
0
13 May 2025
BETTY Dataset: A Multi-modal Dataset for Full-Stack Autonomy
Micah Nye
Ayoub Raji
Andrew Saba
Eidan Erlich
Robert Exley
...
Ritesh Misra
Matthew Sivaprakasam
Marko Bertogna
Deva Ramanan
Sebastian A. Scherer
49
0
0
12 May 2025
The First WARA Robotics Mobile Manipulation Challenge -- Lessons Learned
David Cáceres-Domínguez
M. Iannotta
Abhishek Kashyap
Shuo Sun
Yuxuan Yang
...
Zheng Jia
Graziano Carriero
Sofia Lindqvist
Silvio Di Castro
Matteo Iovino
33
0
0
11 May 2025
Towards Artificial General or Personalized Intelligence? A Survey on Foundation Models for Personalized Federated Intelligence
Yu Qiao
Huy Q. Le
Avi Deb Raha
Phuong-Nam Tran
Apurba Adhikary
Mengchun Zhang
Loc X. Nguyen
Eui-nam Huh
Dusit Niyato
Choong Seon Hong
AI4CE
31
0
0
11 May 2025
UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms
Xueyang Guo
Hongwei Hu
Chengye Song
J. Chen
Zilin Zhao
Yu Fu
Bowen Guan
Zhenze Liu
31
0
0
11 May 2025
Causal Prompt Calibration Guided Segment Anything Model for Open-Vocabulary Multi-Entity Segmentation
Wenwen Qiang
Jianqi Zhang
Jingyao Wang
Changwen Zheng
VLM
37
0
0
10 May 2025
Describe Anything in Medical Images
Xi Xiao
Yunbei Zhang
Thanh-Huy Nguyen
Ba Thinh Lam
Janet Wang
...
Xingjian Li
Xiaobei Wang
Hao Xu
Tianming Liu
Min Xu
MedIm
VLM
49
0
0
09 May 2025
Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization
Zhuang Qi
Sijin Zhou
Lei Meng
Han Hu
Han Yu
Xiangxu Meng
FedML
CML
175
0
0
08 May 2025
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory
Weichen Zhang
Chen Gao
Shiquan Yu
Ruiying Peng
Baining Zhao
Qian Zhang
Jinqiang Cui
Xinlei Chen
Yong Li
LLMAG
LM&Ro
49
0
0
08 May 2025
EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation
Biao Yi
Xavier Hu
Y. Chen
Shengyu Zhang
Hongxia Yang
Fan Wu
Fei Wu
LLMAG
199
0
0
08 May 2025
Visual Affordances: Enabling Robots to Understand Object Functionality
Tommaso Apicella
Alessio Xompero
Andrea Cavallaro
46
0
0
08 May 2025
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Teng Hu
Zhentao Yu
Zhengguang Zhou
Sen Liang
Yuan Zhou
Qin Lin
Qinglin Lu
DiffM
VGen
57
0
0
07 May 2025
Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers
Divyansh Srivastava
Xiang Zhang
He Wen
Chenru Wen
Zhuowen Tu
DiffM
39
0
0
07 May 2025
Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models
Mishal Fatima
Steffen Jung
M. Keuper
45
0
0
06 May 2025
From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection
Guoting Wei
Yu Liu
Xia Yuan
Xizhe Xue
Linlin Guo
Yifan Yang
Chunxia Zhao
Zongwen Bai
Haokui Zhang
Rong Xiao
ObjD
53
0
0
06 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Xuzhi Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
74
0
0
05 May 2025
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
D. Jiang
Mengmeng Wang
Liuzhuozheng Li
Lei Zhang
Haoyu Wang
Wei Wei
Guang Dai
Yanning Zhang
Jingdong Wang
DiffM
51
0
0
05 May 2025
6D Pose Estimation on Spoons and Hands
Kevin Tan
Fan Yang
Yuxiao Chen
47
0
0
05 May 2025
CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
Xiaoqi Li
Lingyun Xu
Hao Fei
Jiaming Liu
Yan Shen
...
Jiahui Xu
Liang Heng
Siyuan Huang
Shanghang Zhang
Hao Dong
LM&Ro
54
0
0
04 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
36
0
0
04 May 2025
LLM-Guided Probabilistic Program Induction for POMDP Model Estimation
Aidan Curtis
Hao Tang
Thiago Veloso
Kevin Ellis
Joshua B. Tenenbaum
Tomás Lozano-Pérez
Leslie Pack Kaelbling
109
0
0
04 May 2025
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Ruiqi Wang
Hao Zhang
VLM
68
0
0
03 May 2025
Robotic Visual Instruction
Y. Li
Ziyang Gong
Yiming Li
Xiaoqi Huang
Haolan Kang
Guangping Bai
Xianzheng Ma
LM&Ro
76
0
0
01 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
A. Yuille
Jieneng Chen
LRM
65
1
0
01 May 2025
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
D. Jiang
Ziyu Guo
Renrui Zhang
Zhuofan Zong
Hao Li
Le Zhuo
Shilin Yan
Pheng-Ann Heng
Yiming Li
LRM
72
2
0
01 May 2025
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding
Trilok Padhi
R. Kaur
Adam D. Cobb
Manoj Acharya
Anirban Roy
Colin Samplawski
Brian Matejek
Alexander M. Berenbeim
Nathaniel D. Bastian
Susmit Jha
28
0
0
30 Apr 2025
OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
Shengkai Chen
Yifang Yin
Jinming Cao
Shili Xiang
Zhenguang Liu
Roger Zimmermann
VOS
VLM
48
0
0
30 Apr 2025
Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision
Weicai Yan
Wang Lin
Zirun Guo
Ye Wang
Fangming Feng
Xiaoda Yang
Zihan Wang
Tao Jin
DiffM
150
2
0
30 Apr 2025
Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
Daniel Bogdoll
Rajanikant Ananta
Abeyankar Giridharan
Isabel Moore
Gregory Stevens
Henry X. Liu
VLM
56
0
0
30 Apr 2025
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs
Pranav Saxena
Nishant Raghuvanshi
Neena Goveas
77
0
0
30 Apr 2025
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Jiaxu Qian
Chendong Wang
Yuqing Yang
Chaoyun Zhang
Huiqiang Jiang
...
Saravan Rajmohan
Dongmei Zhang
Yifan Yang
Qi Zhang
Lili Qiu
VLM
84
0
0
30 Apr 2025
XeMap: Contextual Referring in Large-Scale Remote Sensing Environments
Yong Li
Lu Si
Y. T. Hou
Chengaung Liu
Yangqiu Song
Hongjian Fang
Jun Zhang
82
0
0
30 Apr 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
88
0
0
30 Apr 2025
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception
Yuanchen Wu
Lu Zhang
Hang Yao
Junlong Du
Ke Yan
Shouhong Ding
Yunsheng Wu
Xuzhao Li
MLLM
71
0
0
29 Apr 2025
Simultaneous Pick and Place Detection by Combining SE(3) Diffusion Models with Differential Kinematics
Tianyi Ko
Takuya Ikeda
Koichi Nishiwaki
40
0
0
28 Apr 2025
Explaining Vision GNNs: A Semantic and Visual Analysis of Graph-based Image Classification
Nikolaos Chaidos
Angeliki Dimitriou
Nikolaos Spanos
Athanasios Voulodimos
Giorgos Stamou
43
1
0
28 Apr 2025
If Concept Bottlenecks are the Question, are Foundation Models the Answer?
Nicola Debole
Pietro Barbiero
Francesco Giannini
Andrea Passerini
Stefano Teso
Emanuele Marconato
167
0
0
28 Apr 2025
TransparentGS: Fast Inverse Rendering of Transparent Objects with Gaussians
Letian Huang
Dongwei Ye
Jialin Dan
Chengzhi Tao
Huiwen Liu
Kun Zhou
Bo Ren
Yongbin Li
Yanwen Guo
Jie Guo
50
1
0
26 Apr 2025
Generative AI in Embodied Systems: System-Level Analysis of Performance, Efficiency and Scalability
Zishen Wan
Jiayi Qian
Yuhang Du
Jason J. Jabbour
Yilun Du
Yang Katie Zhao
A. Raychowdhury
Tushar Krishna
Vijay Janapa Reddi
LM&Ro
91
0
0
26 Apr 2025
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Phillip Y. Lee
Jihyeon Je
Chanho Park
Mikaela Angelina Uy
Leonidas J. Guibas
Minhyuk Sung
LRM
46
0
0
24 Apr 2025
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting
Yiming Zhao
Guorong Li
Laiyun Qing
Amin Beheshti
Jian Yang
Michael Sheng
Yuankai Qi
Qingming Huang
VLM
VPVLM
75
0
0
24 Apr 2025
VideoVista-CulturalLingo: 360
∘
^\circ
∘
Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Xinyu Chen
Yunxin Li
Haoyuan Shi
Baotian Hu
Wenhan Luo
Yaowei Wang
Hao Fei
ELM
67
0
0
23 Apr 2025
MorphoNavi: Aerial-Ground Robot Navigation with Object Oriented Mapping in Digital Twin
Sausar Karaf
Mikhail Martynov
Oleg Sautenkov
Zhanibek Darush
Dzmitry Tsetserukou
53
1
0
23 Apr 2025
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Zebin Yao
Lei Ren
Huixing Jiang
Chen Wei
Xiaojie Wang
Ruifan Li
Fangxiang Feng
DiffM
76
0
0
22 Apr 2025
1
2
3
4
...
25
26
27
Next