ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjDVLM
ArXiv (abs)PDFHTMLGithub (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 616 papers shown
Title
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
Konstantinos Vilouras
Pedro Sanchez
Alison Q. OÑeil
Sotirios A. Tsaftaris
MedIm
187
3
0
19 Apr 2024
MLS-Track: Multilevel Semantic Interaction in RMOT
MLS-Track: Multilevel Semantic Interaction in RMOT
Zeliang Ma
Yang Song
Zhe Cui
Zhicheng Zhao
Fei Su
Delong Liu
Jingyu Wang
83
4
0
18 Apr 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
91
3
0
15 Apr 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao
Renjie Pi
Jianhua Han
Xiaodan Liang
Hang Xu
Wei Zhang
Zhenguo Li
Dan Xu
VLMObjD
96
26
0
14 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image
  Captions as Prompts
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
108
11
0
12 Apr 2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
  Language Models
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Haotian Zhang
Haoxuan You
Philipp Dufter
Bowen Zhang
Chen Chen
...
Tsu-Jui Fu
William Y. Wang
Shih-Fu Chang
Zhe Gan
Yinfei Yang
ObjDMLLM
159
51
0
11 Apr 2024
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
77
31
0
11 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLMObjD
67
7
0
07 Apr 2024
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
Seung-bum Chung
Joohyun Park
Hyewon Kan
Hyeongyeop Kang
CLIP
77
1
0
03 Apr 2024
Text-driven Affordance Learning from Egocentric Vision
Text-driven Affordance Learning from Egocentric Vision
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
99
6
0
03 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
148
7
0
28 Mar 2024
J-CRe3: A Japanese Conversation Dataset for Real-world Reference
  Resolution
J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Nobuhiro Ueda
Hideko Habe
Yoko Matsui
Akishige Yuguchi
Seiya Kawano
Yasutomo Kawanishi
Sadao Kurohashi
Koichiro Yoshino
80
3
0
28 Mar 2024
Online Embedding Multi-Scale CLIP Features into 3D Maps
Online Embedding Multi-Scale CLIP Features into 3D Maps
Shun Taguchi
Hideki Deguchi
50
0
0
27 Mar 2024
ReMamber: Referring Image Segmentation with Mamba Twister
ReMamber: Referring Image Segmentation with Mamba Twister
Yu-Hao Yang
Chaofan Ma
Jiangchao Yao
Zhun Zhong
Ya Zhang
Yanfeng Wang
Mamba
108
24
0
26 Mar 2024
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with
  Open-Vocabulary Detection and StructurEd Representation
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
Ganlong Zhao
Guanbin Li
Weikai Chen
Yizhou Yu
94
5
0
26 Mar 2024
Data-Efficient 3D Visual Grounding via Order-Aware Referring
Data-Efficient 3D Visual Grounding via Order-Aware Referring
Tung-Yu Wu
Sheng-Yu Huang
Yu-Chiang Frank Wang
143
0
0
25 Mar 2024
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
Qing Jiang
Feng Li
Zhaoyang Zeng
Tianhe Ren
Shilong Liu
Lei Zhang
VLM
114
44
0
21 Mar 2024
IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video
  Action Counting
IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
Hang Wang
Zhi-Qi Cheng
Youtian Du
Lei Zhang
64
1
0
18 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Generative Region-Language Pretraining for Open-Ended Object Detection
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjDVLM
86
20
0
15 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language
  Interface
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Hongsheng Li
Bernt Schiele
Liwei Wang
VLM
101
13
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling
  and Visual-Language Co-Referring
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
98
14
0
14 Mar 2024
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object
  Detection
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
Hanning Chen
Wenjun Huang
Yang Ni
Sanggeon Yun
Fei Wen
Hugo Latapie
Mohsen Imani
ObjDMLLMVLM
106
18
0
12 Mar 2024
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial
  Creation on Physical Tasks
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks
Yuexi Chen
Vlad I. Morariu
Anh Truong
Zhicheng Liu
DiffMVGen
72
5
0
12 Mar 2024
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large
  Multimodal Models
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
Yang Jiao
Shaoxiang Chen
Zequn Jie
Wenke Huang
Lin Ma
Yueping Jiang
MLLM
85
20
0
12 Mar 2024
Real-time Transformer-based Open-Vocabulary Detection with Efficient
  Fusion Head
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
Tiancheng Zhao
Peng Liu
Xuan He
Lu Zhang
Kyusong Lee
ObjD
70
8
0
11 Mar 2024
Discriminative Probing and Tuning for Text-to-Image Generation
Discriminative Probing and Tuning for Text-to-Image Generation
Leigang Qu
Wenjie Wang
Chak Tou Leong
Hanwang Zhang
Liqiang Nie
Tat-Seng Chua
87
8
0
07 Mar 2024
Detecting Concrete Visual Tokens for Multimodal Machine Translation
Detecting Concrete Visual Tokens for Multimodal Machine Translation
Braeden Bowen
Vipin Vijayan
Scott Grigsby
Timothy Anderson
Jeremy Gwinnup
75
2
0
05 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLMCLIP
168
12
0
05 Mar 2024
RegionGPT: Towards Region Understanding Vision Language Model
RegionGPT: Towards Region Understanding Vision Language Model
Qiushan Guo
Shalini De Mello
Hongxu Yin
Wonmin Byeon
Ka Chun Cheung
Yizhou Yu
Ping Luo
Sifei Liu
VLM
100
37
0
04 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language
  Models without Training
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLMObjD
115
36
0
04 Mar 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi
Qi Dong
Luis Goncalves
Zhuowen Tu
Stefano Soatto
VLM
146
3
0
04 Mar 2024
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model
Huan Ma
Yan Zhu
Changqing Zhang
Peilin Zhao
Baoyuan Wu
Long-Kai Huang
Qinghua Hu
Bing Wu
VLM
161
2
0
01 Mar 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang
Ziqiao Ma
Xiaofeng Gao
Suhaila Shakiah
Qiaozi Gao
Joyce Chai
MLLMVLM
133
47
0
26 Feb 2024
Beyond Literal Descriptions: Understanding and Locating Open-World
  Objects Aligned with Human Intentions
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
Wenxuan Wang
Yisi Zhang
Xingjian He
Yichen Yan
Zijia Zhao
Xinlong Wang
Jing Liu
LM&Ro
81
4
0
17 Feb 2024
Real-World Robot Applications of Foundation Models: A Review
Real-World Robot Applications of Foundation Models: A Review
Kento Kawaharazuka
T. Matsushima
Andrew Gambardella
Jiaxian Guo
Chris Paxton
Andy Zeng
OffRLVLMLM&Ro
116
54
0
08 Feb 2024
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
  Descriptors
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors
Sheng Jin
Xue-Qiu Jiang
Jiaxing Huang
Lewei Lu
Shijian Lu
VLMObjD
91
26
0
07 Feb 2024
Enhancing Embodied Object Detection through Language-Image Pre-training
  and Implicit Object Memory
Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory
N. H. Chapman
Feras Dayoub
Will N. Browne
Chris Lehnert
ObjDVLMLM&Ro
65
1
0
06 Feb 2024
Phrase Grounding-based Style Transfer for Single-Domain Generalized
  Object Detection
Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection
Hao Li
Wei Wang
Cong Wang
Zhigang Luo
Xinwang Liu
KenLi Li
Xiaochun Cao
ObjD
93
1
0
02 Feb 2024
YOLO-World: Real-Time Open-Vocabulary Object Detection
YOLO-World: Real-Time Open-Vocabulary Object Detection
Tianheng Cheng
Lin Song
Yixiao Ge
Wenyu Liu
Xinggang Wang
Ying Shan
VLMObjD
124
300
0
30 Jan 2024
MResT: Multi-Resolution Sensing for Real-Time Control with
  Vision-Language Models
MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Saumya Saxena
Mohit Sharma
Oliver Kroemer
85
4
0
25 Jan 2024
Generalizing Visual Question Answering from Synthetic to Human-Written
  Questions via a Chain of QA with a Large Language Model
Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model
Taehee Kim
Yeongjae Cho
Heejun Shin
Yohan Jo
Dongmyung Shin
103
4
0
12 Jan 2024
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
Bowen Shi
Peisen Zhao
Zichen Wang
Yuhang Zhang
Yaoming Wang
...
Wenrui Dai
Junni Zou
Hongkai Xiong
Qi Tian
Xiaopeng Zhang
VLM
63
8
0
12 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
GroundingGPT:Language Enhanced Multi-modal Grounding Model
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
130
44
0
11 Jan 2024
An Open and Comprehensive Pipeline for Unified Object Grounding and
  Detection
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Xiangyu Zhao
Yicheng Chen
Shilin Xu
Xiangtai Li
Xinjiang Wang
Yining Li
Haian Huang
ObjDAI4CE
102
32
0
04 Jan 2024
Context-Guided Spatio-Temporal Video Grounding
Context-Guided Spatio-Temporal Video Grounding
Xin Gu
Hengrui Fan
Yan Huang
Tiejian Luo
Libo Zhang
100
16
0
03 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question
  Answering
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai
Ruiping Wang
Xilin Chen
163
8
0
03 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object
  Detectors
Generating Enhanced Negatives for Training Language-Based Object Detectors
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjDVLM
121
6
0
29 Dec 2023
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal
  Distillation
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation
Jiaxi Wang
Wenhui Hu
Xueyang Liu
Beihu Wu
Yuting Qiu
Yingying Cai
49
1
0
29 Dec 2023
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Yifan Lu
Ziqi Zhang
Chunfen Yuan
Peng Li
Yan Wang
Bing Li
Weiming Hu
74
5
0
25 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
106
18
0
25 Dec 2023
Previous
12345...111213
Next