Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.12763
Cited By
v1
v2 (latest)
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1008★)
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 616 papers shown
Title
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
Zizhao Li
Zhengkang Xiang
Joseph West
Kourosh Khoshelham
ObjD
VLM
187
1
0
27 Nov 2024
Open Vocabulary Monocular 3D Object Detection
Jin Yao
Hao Gu
Xuweiyi Chen
Jiayun Wang
Zezhou Cheng
ObjD
VLM
121
3
0
25 Nov 2024
Leverage Task Context for Object Affordance Ranking
Haojie Huang
Hongchen Luo
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
137
0
0
25 Nov 2024
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
138
2
0
20 Nov 2024
TrojanRobot: Physical-World Backdoor Attacks Against VLM-based Robotic Manipulation
Xiaobei Wang
Hewen Pan
Hangtao Zhang
Minghui Li
Shengshan Hu
...
Peijin Guo
Yichen Wang
Wei Wan
Aishan Liu
L. Zhang
AAML
187
2
0
18 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Zechao Li
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
74
4
0
14 Nov 2024
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding
Hao Guo
Wei Fan
Baichun Wei
Jianfei Zhu
Jin Tian
Chunzhi Yi
Feng Jiang
72
0
0
13 Nov 2024
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers
Yeong-Seung Baek
Heung-Seon Oh
66
0
0
07 Nov 2024
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
Seongsu Ha
Chaeyun Kim
Donghwa Kim
Junho Lee
Sangho Lee
Joonseok Lee
119
4
0
03 Nov 2024
Referring Human Pose and Mask Estimation in the Wild
Bo Miao
Mingtao Feng
Zijie Wu
Mohammed Bennamoun
Yongsheng Gao
Ajmal Mian
88
0
0
27 Oct 2024
Zero-shot Object Navigation with Vision-Language Models Reasoning
Congcong Wen
Yisiyuan Huang
Hao Huang
Yanjia Huang
Shuaihang Yuan
Yu Hao
Hui Lin
Yu-Shen Liu
Yi Fang
LM&Ro
134
10
0
24 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
92
1
0
21 Oct 2024
Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability
Yusuke Hosoya
Masanori Suganuma
Takayuki Okatani
ObjD
92
0
0
20 Oct 2024
Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation
Changcheng Xiao
Qiong Cao
Yujie Zhong
Xiang Zhang
Tao Wang
Canqun Yang
L. Lan
70
0
0
17 Oct 2024
Context-Infused Visual Grounding for Art
Selina Khan
Nanne van Noord
ObjD
71
1
0
16 Oct 2024
DINTR: Tracking via Diffusion-based Interpolation
Pha Nguyen
Ngan Le
J. Cothren
Alper Yilmaz
Khoa Luu
DiffM
93
1
0
14 Oct 2024
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
DiffM
84
6
0
14 Oct 2024
DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios
Siyi Jiao
Wenzheng Zeng
Changxin Gao
Nong Sang
60
1
0
13 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
126
7
0
10 Oct 2024
G
2
^{2}
2
TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora
N. N.
Aman Tambi
Sandeep S. Zachariah
Souvik Chakraborty
Rohan Paul
LM&Ro
62
0
0
10 Oct 2024
Structured Spatial Reasoning with Open Vocabulary Object Detectors
Negar Nejatishahidin
Madhukar Reddy Vongala
Jana Kosecka
90
3
0
09 Oct 2024
Grounding Partially-Defined Events in Multimodal Data
Kate Sanders
Reno Kriz
David Etter
Hannah Recknor
Alexander Martin
Cameron Carpenter
Jingyang Lin
Benjamin Van Durme
61
2
0
07 Oct 2024
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
Mengxue Qu
Xiaodong Chen
Wu Liu
Alicia Li
Yao Zhao
88
18
0
01 Oct 2024
You Only Speak Once to See
Wenhao Yang
Jianguo Wei
Wenhuan Lu
Lei Li
VOS
63
2
0
27 Sep 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Ming Dai
Lingfeng Yang
Yihao Xu
Zhenhua Feng
Wankou Yang
ObjD
125
13
0
26 Sep 2024
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Raja Kumar
Raghav Singhal
Pranamya Kulkarni
Deval Mehta
Kshitij S. Jadhav
83
0
0
26 Sep 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Junzhuo Liu
Xiaohu Yang
Weiwei Li
Peng Wang
ObjD
139
5
0
23 Sep 2024
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
A. Mavrogiannis
Dehao Yuan
Yiannis Aloimonos
LM&Ro
89
0
0
23 Sep 2024
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
Ting Liu
Zunnan Xu
Yue Hu
Liangtao Shi
Zhiqiang Wang
Quanjun Yin
157
3
0
20 Sep 2024
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
Amaia Cardiel
Éloi Zablocki
Oriane Siméoni
Elias Ramzi
Matthieu Cord
VLM
77
0
0
18 Sep 2024
Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints
Chen Jiang
Allie Luo
Martin Jägersand
68
1
0
17 Sep 2024
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Haoxuan Wang
Qu He
Jinlong Peng
Hao Yang
Mingmin Chi
Yabiao Wang
Mamba
104
2
0
13 Sep 2024
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
Hanning Chen
Yang Ni
Wenjun Huang
Yezi Liu
SungHeon Jeong
Fei Wen
Nathaniel D. Bastian
Hugo Latapie
Mohsen Imani
VLM
85
4
0
13 Sep 2024
An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection
Pengfei Qi
Yifei Zhang
Wenqiang Li
Youwen Hu
Kunlong Bai
ObjD
83
0
0
10 Sep 2024
Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers
Gorka Abad
S. Picek
Lorenzo Cavallaro
A. Urbieta
SILM
79
0
0
06 Sep 2024
Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression
Jingcheng Ke
Dele Wang
Jun-Cheng Chen
I-Hong Jhuo
Chia-Wen Lin
Yen-Yu Lin
82
0
0
05 Sep 2024
More Pictures Say More: Visual Intersection Network for Open Set Object Detection
Bingcheng Dong
Yuning Ding
Jinrong Zhang
Sifan Zhang
Shenglan Liu
ObjD
85
0
0
26 Aug 2024
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
Ali Asgarov
Samir Rustamov
VLM
38
1
0
25 Aug 2024
R2G: Reasoning to Ground in 3D Scenes
Yixuan Li
Zan Wang
Wei Liang
87
2
0
24 Aug 2024
D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models
Matteo Forlini
Mihail Babcinschi
Giacomo Palmieri
Pedro Neto
87
1
0
21 Aug 2024
On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes
Sadia Ilyas
Ido Freeman
Matthias Rottmann
ObjD
107
3
0
20 Aug 2024
Towards Flexible Visual Relationship Segmentation
Fangrui Zhu
Jianwei Yang
Huaizu Jiang
VOS
100
2
0
15 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen
Mahdieh Hatamian
Yu Wu
102
5
0
02 Aug 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
95
2
0
28 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
85
3
0
23 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
63
0
0
23 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
75
2
0
22 Jul 2024
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Kwanyong Park
Kuniaki Saito
Donghyun Kim
VLM
CoGe
92
1
0
21 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
113
7
0
18 Jul 2024
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
Yang Zhou
Yongjian Wu
Jiya Saiyin
Bingzheng Wei
Maode Lai
Eric Chang
Yan Xu
VLM
90
1
0
16 Jul 2024
Previous
1
2
3
4
5
...
11
12
13
Next