Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.12763
Cited By
v1
v2 (latest)
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1008★)
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 616 papers shown
Title
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer
Yu Wang
Xiangbo Su
Qiang Chen
Xinyu Zhang
Teng Xi
Kun Yao
Errui Ding
Gang Zhang
Jingdong Wang
ObjD
VLM
60
1
0
15 Jul 2024
Pathformer3D: A 3D Scanpath Transformer for 360° Images
Rong Quan
Yantao Lai
Mengyu Qiu
Dong Liang
ViT
71
0
0
15 Jul 2024
Plain-Det: A Plain Multi-Dataset Object Detector
Cheng Shi
Yuchen Zhu
Sibei Yang
ObjD
VLM
89
2
0
14 Jul 2024
Layer-Wise Relevance Propagation with Conservation Property for ResNet
Seitaro Otsuki
T. Iida
Félix Doublet
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
FAtt
118
4
0
12 Jul 2024
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Byeonghyun Pak
Byeongju Woo
Sunghwan Kim
Dae-Hwan Kim
Hoseong Kim
136
5
0
12 Jul 2024
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
Haiwen Diao
Bo Wan
Xu Jia
Yunzhi Zhuge
Ying Zhang
Huchuan Lu
Long Chen
VLM
95
4
0
10 Jul 2024
ActionVOS: Actions as Prompts for Video Object Segmentation
Liangyang Ouyang
Ruicong Liu
Yifei Huang
Ryosuke Furuta
Yoichi Sato
VOS
79
2
0
10 Jul 2024
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
Yang Liu
Weixing Chen
Yongjie Bai
Xiaodan Liang
Guanbin Li
Wen Gao
Liang Lin
LM&Ro
SyDa
AI4CE
161
70
0
09 Jul 2024
Multi-Object Hallucination in Vision-Language Models
Xuweiyi Chen
Ziqiao Ma
Xuejun Zhang
Sihan Xu
Shengyi Qian
Jianing Yang
David Fouhey
Joyce Chai
84
20
0
08 Jul 2024
Described Spatial-Temporal Video Detection
Wei Ji
Xiangyan Liu
Yingfei Sun
Jiajun Deng
You Qin
Ammar Nuwanna
Mengyao Qiu
Lina Wei
Roger Zimmermann
115
2
0
08 Jul 2024
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
Jiedong Zhuang
Jiaqi Hu
Lianrui Mu
Rui Hu
Xiaoyu Liang
Jiangnan Ye
Haoji Hu
CLIP
VLM
104
4
0
08 Jul 2024
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
Pengkun Jiao
Na Zhao
Jingjing Chen
Yu-Gang Jiang
VLM
ObjD
72
3
0
07 Jul 2024
Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model
D. M. Nguyen
An T. Le
Trung Q. Nguyen
Nghiem Tuong Diep
Tai Nguyen
D. Duong-Tran
Jan Peters
Li Shen
Mathias Niepert
Daniel Sonntag
VLM
85
2
0
05 Jul 2024
VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation
I-Chun Arthur Liu
Sicheng He
Daniel Seita
Gaurav Sukhatme
LM&Ro
93
13
0
04 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
107
4
0
02 Jul 2024
Camera-LiDAR Cross-modality Gait Recognition
Wenxuan Guo
Yingping Liang
Zhiyu Pan
Ziheng Xi
Jianjiang Feng
Jie Zhou
CVBM
95
4
0
02 Jul 2024
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA
Hailiang Zhang
Dian Chao
Zhihao Guan
Yang Yang
78
0
0
02 Jul 2024
Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models
Takayuki Nishimura
Katsuyuki Kuyo
Motonari Kambara
Komei Sugiura
DiffM
98
0
0
01 Jul 2024
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Yicheng Chen
Xiangtai Li
Yining Li
Yanhong Zeng
Jianzong Wu
Xiangyu Zhao
Kai Chen
VLM
DiffM
162
3
0
28 Jun 2024
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang
Tianheng Cheng
Lianghui Zhu
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
VLM
196
31
0
28 Jun 2024
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models
Georgios Tziafas
Hamidreza Kasaei
KELM
LM&Ro
98
9
0
26 Jun 2024
Towards Open-World Grasping with Large Vision-Language Models
Georgios Tziafas
Hamidreza Kasaei
LM&Ro
LRM
122
15
0
26 Jun 2024
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
ObjD
105
9
0
26 Jun 2024
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Jierun Chen
Fangyun Wei
Jinjing Zhao
Sizhe Song
Bohuai Wu
Zhuoxuan Peng
S.-H. Gary Chan
Hongyang R. Zhang
103
9
0
24 Jun 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&Ro
VLM
110
27
0
17 Jun 2024
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
Daizong Liu
Yang Liu
Wencan Huang
Wei Hu
LM&Ro
115
9
0
09 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
133
14
0
08 Jun 2024
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
Qiaomu Miao
Alexandros Graikos
Jingwei Zhang
Sounak Mondal
Minh Hoai
Dimitris Samaras
149
0
0
04 Jun 2024
Multi-layer Learnable Attention Mask for Multimodal Tasks
Wayner Barrios
SouYoung Jin
73
1
0
04 Jun 2024
ELSA: Evaluating Localization of Social Activities in Urban Streets
Maryam Hosseini
Marco Cipriano
Sedigheh Eslami
Daniel Hodczak
Liu Liu
Andres Sevtsuk
Gerard de Melo
67
0
0
03 Jun 2024
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation
Danni Yang
Jiayi Ji
Yiwei Ma
Tianyu Guo
Haowei Wang
Xiaoshuai Sun
Rongrong Ji
ISeg
VLM
97
6
0
03 Jun 2024
Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection
Yang Cao
Yihan Zeng
Hang Xu
Dan Xu
3DPC
ObjD
80
6
0
02 Jun 2024
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
Fangyi Chen
Han Zhang
Zhantao Yang
Hao Chen
Kai Hu
Marios Savvides
ObjD
VLM
86
5
0
30 May 2024
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang
Mengxue Qu
Jyoti Kini
Yunchao Wei
Mubarak Shah
Yan Yan
LM&Ro
3DPC
110
10
0
28 May 2024
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
Haoyu Zhao
Wenhang Ge
Ying-Cong Chen
ObjD
MLLM
VLM
88
5
0
27 May 2024
VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
Kuo-Han Hung
Pang-Chi Lo
Jia-Fong Yeh
Han-Yuan Hsu
Yi-Ting Chen
Winston H. Hsu
159
0
0
26 May 2024
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman
Rajat Chawla
Muskaan Kumar
Arkajit Datta
Adarsh Jha
NS Mukunda
Ishaan Bhola
113
3
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
335
54
0
23 May 2024
Open-Vocabulary Spatio-Temporal Action Detection
Tao Wu
Shuqiu Ge
Jie Qin
Gangshan Wu
Limin Wang
ObjD
75
7
0
17 May 2024
Grounded 3D-LLM with Referent Tokens
Yilun Chen
Shuai Yang
Haifeng Huang
Tai Wang
Ruiyuan Lyu
Runsen Xu
Dahua Lin
Jiangmiao Pang
110
37
0
16 May 2024
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Tianhe Ren
Qing Jiang
Shilong Liu
Zhaoyang Zeng
Wenlong Liu
...
Hao Zhang
Feng Li
Peijun Tang
Kent Yu
Lei Zhang
ObjD
VLM
133
38
0
16 May 2024
Spatial Semantic Recurrent Mining for Referring Image Segmentation
Jiaxing Yang
Lihe Zhang
Jiayu Sun
Huchuan Lu
96
0
0
15 May 2024
Language-Image Models with 3D Understanding
Jang Hyun Cho
Boris Ivanovic
Yulong Cao
Edward Schmerling
Yue Wang
...
Boyi Li
Yurong You
Philipp Krahenbuhl
Yan Wang
Marco Pavone
LRM
72
19
0
06 May 2024
ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection
Arpit Bahety
Priyanka Mandikal
Ben Abbatematteo
Roberto Martín-Martín
111
15
0
06 May 2024
Transcrib3D: 3D Referring Expression Resolution through Large Language Models
Jiading Fang
Xiangshan Tan
Shengjie Lin
Igor Vasiljevic
Vitor Campagnolo Guizilini
Hongyuan Mei
Rares Andrei Ambrus
Gregory Shakhnarovich
Matthew R. Walter
LM&Ro
72
4
0
30 Apr 2024
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Navid Rajabi
Jana Kosecka
76
1
0
29 Apr 2024
Closed Loop Interactive Embodied Reasoning for Robot Manipulation
Michal Nazarczuk
Jan Kristof Behrens
Karla Stepanova
Matej Hoffmann
K. Mikolajczyk
LM&Ro
LRM
134
1
0
23 Apr 2024
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers
Georgios Pantazopoulos
Alessandro Suglia
Oliver Lemon
Arash Eshghi
VLM
68
4
0
21 Apr 2024
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
137
12
0
20 Apr 2024
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma
Yi Jiang
Jiannan Wu
Zehuan Yuan
Xiaojuan Qi
VLM
ObjD
113
65
0
19 Apr 2024
Previous
1
2
3
4
5
6
...
11
12
13
Next