Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.12763
Cited By
v1
v2 (latest)
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1008★)
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 616 papers shown
Title
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Shuquan Ye
Yujia Xie
Dongdong Chen
Yichong Xu
Lu Yuan
Chenguang Zhu
Jing Liao
VLM
66
12
0
29 Nov 2022
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
Siyi Liu
Yaoyuan Liang
Feng Li
Shijia Huang
Hao Zhang
Hang Su
Jun Zhu
Lei Zhang
ObjD
105
28
0
28 Nov 2022
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation
Jiangyong Huang
William Zhu
Baoxiong Jia
Zan Wang
Xiaojian Ma
Qing Li
Siyuan Huang
123
5
0
28 Nov 2022
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai
Qi Zhang
Tong Wu
Xinghan Chen
Jiangjiang Liu
Bo Ren
Ming-Ming Cheng
ObjD
VLM
66
1
0
28 Nov 2022
Learning Object-Language Alignments for Open-Vocabulary Object Detection
Chuang Lin
Pei Sun
Yi Jiang
Ping Luo
Zhuang Li
Gholamreza Haffari
Zehuan Yuan
Jianfei Cai
VLM
ObjD
80
98
0
27 Nov 2022
PUnifiedNER: A Prompting-based Unified NER System for Diverse Datasets
Jinghui Lu
Rui Zhao
Brian Mac Namee
Fei Tan
91
21
0
27 Nov 2022
Who are you referring to? Coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
102
3
0
26 Nov 2022
Language-Assisted 3D Feature Learning for Semantic Scene Understanding
Junbo Zhang
Guo Fan
Guanghan Wang
Zhèngyuān Sū
Kaisheng Ma
L. Yi
3DPC
78
7
0
25 Nov 2022
TPA-Net: Generate A Dataset for Text to Physics-based Animation
Yuxing Qiu
Feng Gao
Minchen Li
Govind Thattai
Yin Yang
Chenfanfu Jiang
PINN
DiffM
VGen
60
0
0
25 Nov 2022
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
R. Burgert
Kanchana Ranasinghe
Xiang Li
Michael S. Ryoo
DiffM
VLM
90
38
0
23 Nov 2022
X
2
^2
2
-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Yan Zeng
Xinsong Zhang
Hang Li
Jiawei Wang
Jipeng Zhang
Hkust Wangchunshu Zhou
VLM
MLLM
63
15
0
22 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLM
CLIP
70
4
0
21 Nov 2022
Unifying Tracking and Image-Video Object Detection
Peirong Liu
Rui Wang
Pengchuan Zhang
Omid Poursaeed
Yipin Zhou
Xuefei Cao
Sreya . Dutta Roy
Ashish Shah
Ser-Nam Lim
93
0
0
20 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
78
11
0
20 Nov 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
99
88
0
17 Nov 2022
A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation
Shijia Huang
Feng Li
Hao Zhang
Siyi Liu
Lei Zhang
Liwei Wang
68
5
0
15 Nov 2022
YORO -- Lightweight End to End Visual Grounding
Chih-Hui Ho
Srikar Appalaraju
Bhavan A. Jasani
R. Manmatha
Nuno Vasconcelos
ObjD
60
22
0
15 Nov 2022
Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing
Aditay Tripathi
Anand Mishra
Anirban Chakraborty
53
2
0
03 Nov 2022
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
Henghui Ding
Chang Liu
Suchen Wang
Xudong Jiang
138
124
0
28 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
57
6
0
24 Oct 2022
Extending Phrase Grounding with Pronouns in Visual Dialogues
Panzhong Lu
Xin Zhang
Meishan Zhang
Min Zhang
ObjD
74
4
0
23 Oct 2022
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
Jiaming Chen
Weihua Luo
Ran Song
Xiaolin K. Wei
Lin Ma
Wei Emma Zhang
3DV
86
6
0
22 Oct 2022
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation
Pengfei Li
Beiwen Tian
Yongliang Shi
Xiaoxue Chen
Hao Zhao
Guyue Zhou
Ya Zhang
118
22
0
19 Oct 2022
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation
Hongcheng Guo
Jiaheng Liu
Haoyang Huang
Jian Yang
Zhoujun Li
Dongdong Zhang
Zheng Cui
Furu Wei
93
22
0
19 Oct 2022
Perceptual Grouping in Contrastive Vision-Language Models
Kanchana Ranasinghe
Brandon McKinzie
S. S. Ravi
Yinfei Yang
Alexander Toshev
Jonathon Shlens
VLM
142
55
0
18 Oct 2022
How to Train Vision Transformer on Small-scale Datasets?
Hanan Gani
Muzammal Naseer
Mohammad Yaqub
ViT
70
51
0
13 Oct 2022
Visual Classification via Description from Large Language Models
Sachit Menon
Carl Vondrick
VLM
103
303
0
13 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
70
1
0
12 Oct 2022
Visual Language Maps for Robot Navigation
Chen Huang
Oier Mees
Andy Zeng
Wolfram Burgard
LM&Ro
307
372
0
11 Oct 2022
Understanding Embodied Reference with Touch-Line Transformer
Yongqian Li
Xiaoxue Chen
Hao Zhao
Jiangtao Gong
Guyue Zhou
Federico Rossano
Yixin Zhu
172
17
0
11 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
82
33
0
11 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
82
22
0
09 Oct 2022
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath
Peter Anderson
Su Wang
Jing Yu Koh
Alexander Ku
Austin Waters
Yinfei Yang
Jason Baldridge
Zarana Parekh
LM&Ro
104
48
0
06 Oct 2022
Video Referring Expression Comprehension via Transformer with Content-aware Query
Ji Jiang
Meng Cao
Tengtao Song
Yuexian Zou
88
5
0
06 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
114
19
0
05 Oct 2022
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models
Guangyi Chen
Weiran Yao
Xiangchen Song
Xinyue Li
Yongming Rao
Kun Zhang
VPVLM
VLM
96
62
0
03 Oct 2022
Introducing Vision Transformer for Alzheimer's Disease classification task with 3D input
Zilun Zhang
Farzad Khalvati
MedIm
ViT
47
10
0
03 Oct 2022
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Yanmin Wu
Xinhua Cheng
Renrui Zhang
Zesen Cheng
Jian Zhang
151
70
0
29 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
105
28
0
28 Sep 2022
Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks
Zhiyang Chen
Yousong Zhu
Zhaowen Li
Fan Yang
Wei Li
...
Chaoyang Zhao
Liwei Wu
Rui Zhao
Jinqiao Wang
Ming Tang
VLM
VOS
124
16
0
28 Sep 2022
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
83
34
0
27 Sep 2022
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos
Tomás Crisol
J. Ermantraut
Adrián Rostagno
Santiago L. Aggio
Javier Iparraguirre
87
0
0
21 Sep 2022
Towards Robust Referring Image Segmentation
Jianzong Wu
Xiangtai Li
Xia Li
Henghui Ding
Yu Tong
Dacheng Tao
3DV
105
45
0
20 Sep 2022
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
Lewei Yao
Jianhua Han
Youpeng Wen
Xiaodan Liang
Dan Xu
Wei Zhang
Zhenguo Li
Chunjing Xu
Hang Xu
CLIP
VLM
191
161
0
20 Sep 2022
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
Manli Shu
Weili Nie
De-An Huang
Zhiding Yu
Tom Goldstein
Anima Anandkumar
Chaowei Xiao
VLM
VPVLM
282
309
0
15 Sep 2022
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar
Lucas Manuelli
Dieter Fox
LM&Ro
304
501
0
12 Sep 2022
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network
Tiancheng Zhao
Peng Liu
Kyusong Lee
VLM
MLLM
ObjD
42
5
0
10 Sep 2022
RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan
Jianwen Jiang
Samuel Albanie
Tao Feng
Ziyuan Huang
Dong Ni
Mingqian Tang
VLM
112
55
0
05 Sep 2022
Injecting Image Details into CLIP's Feature Space
Zilun Zhang
Cuifeng Shen
Yuan-Chung Shen
Huixin Xiong
Xinyu Zhou
VLM
CLIP
62
1
0
31 Aug 2022
Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors
Gongjie Zhang
Zhipeng Luo
Zichen Tian
Yingchen Yu
Jingyi Zhang
Shijian Lu
106
29
0
24 Aug 2022
Previous
1
2
3
...
10
11
12
13
9
Next