ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjDVLM
ArXiv (abs)PDFHTMLGithub (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 616 papers shown
Title
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
Zizhao Li
Zhengkang Xiang
Joseph West
Kourosh Khoshelham
ObjDVLM
187
1
0
27 Nov 2024
Open Vocabulary Monocular 3D Object Detection
Open Vocabulary Monocular 3D Object Detection
Jin Yao
Hao Gu
Xuweiyi Chen
Jiayun Wang
Zezhou Cheng
ObjDVLM
121
3
0
25 Nov 2024
Leverage Task Context for Object Affordance Ranking
Leverage Task Context for Object Affordance Ranking
Haojie Huang
Hongchen Luo
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
137
0
0
25 Nov 2024
Learning to Reason Iteratively and Parallelly for Complex Visual
  Reasoning Scenarios
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLMLRM
138
2
0
20 Nov 2024
TrojanRobot: Physical-World Backdoor Attacks Against VLM-based Robotic Manipulation
Xiaobei Wang
Hewen Pan
Hangtao Zhang
Minghui Li
Shengshan Hu
...
Peijin Guo
Yichen Wang
Wei Wan
Aishan Liu
L. Zhang
AAML
187
2
0
18 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment
  in Multi-Modal Models
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Zechao Li
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
74
4
0
14 Nov 2024
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference
  Understanding
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding
Hao Guo
Wei Fan
Baichun Wei
Jianfei Zhu
Jin Tian
Chunzhi Yi
Feng Jiang
72
0
0
13 Nov 2024
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with
  Transformers
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers
Yeong-Seung Baek
Heung-Seon Oh
66
0
0
07 Nov 2024
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image
  Segmentation
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
Seongsu Ha
Chaeyun Kim
Donghwa Kim
Junho Lee
Sangho Lee
Joonseok Lee
119
4
0
03 Nov 2024
Referring Human Pose and Mask Estimation in the Wild
Referring Human Pose and Mask Estimation in the Wild
Bo Miao
Mingtao Feng
Zijie Wu
Mohammed Bennamoun
Yongsheng Gao
Ajmal Mian
88
0
0
27 Oct 2024
Zero-shot Object Navigation with Vision-Language Models Reasoning
Zero-shot Object Navigation with Vision-Language Models Reasoning
Congcong Wen
Yisiyuan Huang
Hao Huang
Yanjia Huang
Shuaihang Yuan
Yu Hao
Hui Lin
Yu-Shen Liu
Yi Fang
LM&Ro
134
10
0
24 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large
  Multimodal Models
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
92
1
0
21 Oct 2024
Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object
  Detection Considering Text Describability
Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability
Yusuke Hosoya
Masanori Suganuma
Takayuki Okatani
ObjD
92
0
0
20 Oct 2024
Temporal-Enhanced Multimodal Transformer for Referring Multi-Object
  Tracking and Segmentation
Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation
Changcheng Xiao
Qiong Cao
Yujie Zhong
Xiang Zhang
Tao Wang
Canqun Yang
L. Lan
70
0
0
17 Oct 2024
Context-Infused Visual Grounding for Art
Context-Infused Visual Grounding for Art
Selina Khan
Nanne van Noord
ObjD
71
1
0
16 Oct 2024
DINTR: Tracking via Diffusion-based Interpolation
DINTR: Tracking via Diffusion-based Interpolation
Pha Nguyen
Ngan Le
J. Cothren
Alper Yilmaz
Khoa Luu
DiffM
93
1
0
14 Oct 2024
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
DiffM
84
6
0
14 Oct 2024
DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios
DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios
Siyi Jiao
Wenzheng Zeng
Changxin Gao
Nong Sang
60
1
0
13 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with
  Mask Referring Modeling
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
126
7
0
10 Oct 2024
G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction
  Following by Combining Large Pre-trained Models
G2^{2}2TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora
N. N.
Aman Tambi
Sandeep S. Zachariah
Souvik Chakraborty
Rohan Paul
LM&Ro
62
0
0
10 Oct 2024
Structured Spatial Reasoning with Open Vocabulary Object Detectors
Structured Spatial Reasoning with Open Vocabulary Object Detectors
Negar Nejatishahidin
Madhukar Reddy Vongala
Jana Kosecka
90
3
0
09 Oct 2024
Grounding Partially-Defined Events in Multimodal Data
Grounding Partially-Defined Events in Multimodal Data
Kate Sanders
Reno Kriz
David Etter
Hannah Recknor
Alexander Martin
Cameron Carpenter
Jingyang Lin
Benjamin Van Durme
61
2
0
07 Oct 2024
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large
  Language Models
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
Mengxue Qu
Xiaodong Chen
Wu Liu
Alicia Li
Yao Zhao
88
18
0
01 Oct 2024
You Only Speak Once to See
You Only Speak Once to See
Wenhao Yang
Jianguo Wei
Wenhuan Lu
Lei Li
VOS
63
2
0
27 Sep 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled
  Multi-modal Fusion
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Ming Dai
Lingfeng Yang
Yihao Xu
Zhenhua Feng
Wankou Yang
ObjD
125
13
0
26 Sep 2024
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Raja Kumar
Raghav Singhal
Pranamya Kulkarni
Deval Mehta
Kshitij S. Jadhav
83
0
0
26 Sep 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Junzhuo Liu
Xiaohu Yang
Weiwei Li
Peng Wang
ObjD
139
5
0
23 Sep 2024
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
A. Mavrogiannis
Dehao Yuan
Yiannis Aloimonos
LM&Ro
89
0
0
23 Sep 2024
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
Ting Liu
Zunnan Xu
Yue Hu
Liangtao Shi
Zhiqiang Wang
Quanjun Yin
157
3
0
20 Sep 2024
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
Amaia Cardiel
Éloi Zablocki
Oriane Siméoni
Elias Ramzi
Matthieu Cord
VLM
77
0
0
18 Sep 2024
Robot Manipulation in Salient Vision through Referring Image
  Segmentation and Geometric Constraints
Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints
Chen Jiang
Allie Luo
Martin Jägersand
68
1
0
17 Sep 2024
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary
  Detection
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Haoxuan Wang
Qu He
Jinlong Peng
Hao Yang
Mingmin Chi
Yabiao Wang
Mamba
104
2
0
13 Sep 2024
VLTP: Vision-Language Guided Token Pruning for Task-Oriented
  Segmentation
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
Hanning Chen
Yang Ni
Wenjun Huang
Yezi Liu
SungHeon Jeong
Fei Wen
Nathaniel D. Bastian
Hugo Latapie
Mohsen Imani
VLM
85
4
0
13 Sep 2024
An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open
  Detection
An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection
Pengfei Qi
Yifei Zhang
Wenqiang Li
Youwen Hu
Kunlong Bai
ObjD
83
0
0
10 Sep 2024
Context is the Key: Backdoor Attacks for In-Context Learning with Vision
  Transformers
Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers
Gorka Abad
S. Picek
Lorenzo Cavallaro
A. Urbieta
SILM
79
0
0
06 Sep 2024
Make Graph-based Referring Expression Comprehension Great Again through
  Expression-guided Dynamic Gating and Regression
Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression
Jingcheng Ke
Dele Wang
Jun-Cheng Chen
I-Hong Jhuo
Chia-Wen Lin
Yen-Yu Lin
82
0
0
05 Sep 2024
More Pictures Say More: Visual Intersection Network for Open Set Object
  Detection
More Pictures Say More: Visual Intersection Network for Open Set Object Detection
Bingcheng Dong
Yuning Ding
Jinrong Zhang
Sifan Zhang
Shenglan Liu
ObjD
85
0
0
26 Aug 2024
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages
  in Multimodal Image Retrieval Task
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
Ali Asgarov
Samir Rustamov
VLM
38
1
0
25 Aug 2024
R2G: Reasoning to Ground in 3D Scenes
R2G: Reasoning to Ground in 3D Scenes
Yixuan Li
Zan Wang
Wei Liang
87
2
0
24 Aug 2024
D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal
  models
D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models
Matteo Forlini
Mihail Babcinschi
Giacomo Palmieri
Pedro Neto
87
1
0
21 Aug 2024
On the Potential of Open-Vocabulary Models for Object Detection in
  Unusual Street Scenes
On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes
Sadia Ilyas
Ido Freeman
Matthias Rottmann
ObjD
107
3
0
20 Aug 2024
Towards Flexible Visual Relationship Segmentation
Towards Flexible Visual Relationship Segmentation
Fangrui Zhu
Jianwei Yang
Huaizu Jiang
VOS
100
2
0
15 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for
  Multi-Task Visual Grounding
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen
Mahdieh Hatamian
Yu Wu
102
5
0
02 Aug 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Look Hear: Gaze Prediction for Speech-directed Human Attention
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
95
2
0
28 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
85
3
0
23 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
63
0
0
23 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
75
2
0
22 Jul 2024
Weak-to-Strong Compositional Learning from Generative Models for
  Language-based Object Detection
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Kwanyong Park
Kuniaki Saito
Donghyun Kim
VLMCoGe
92
1
0
21 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
113
7
0
18 Jul 2024
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language
  Pre-trained Models
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
Yang Zhou
Yongjian Wu
Jiya Saiyin
Bingzheng Wei
Maode Lai
Eric Chang
Yan Xu
VLM
90
1
0
16 Jul 2024
Previous
12345...111213
Next