ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.06230
  4. Cited By
Simple Open-Vocabulary Object Detection with Vision Transformers

Simple Open-Vocabulary Object Detection with Vision Transformers

12 May 2022
Matthias Minderer
A. Gritsenko
Austin Stone
Maxim Neumann
Dirk Weissenborn
Alexey Dosovitskiy
Aravindh Mahendran
Anurag Arnab
Mostafa Dehghani
Zhuoran Shen
Tianlin Li
Xiaohua Zhai
Thomas Kipf
N. Houlsby
    ObjD
    CLIP
    VLM
    ViT
    OCL
ArXivPDFHTML

Papers citing "Simple Open-Vocabulary Object Detection with Vision Transformers"

50 / 247 papers shown
Title
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
  the Generalizability of Video Question Answering Models
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Dohwan Ko
Ji Soo Lee
M. Choi
Jaewon Chu
Jihwan Park
Hyunwoo J. Kim
22
5
0
18 Aug 2023
Taming Self-Training for Open-Vocabulary Object Detection
Taming Self-Training for Open-Vocabulary Object Detection
Shiyu Zhao
S. Schulter
Long Zhao
Zhixing Zhang
Vijay Kumar B.G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
37
12
0
11 Aug 2023
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
  Control
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan
Noah Brown
Justice Carbajal
Yevgen Chebotar
Xi Chen
...
Ted Xiao
Peng-Tao Xu
Sichun Xu
Tianhe Yu
Brianna Zitkovich
LM&Ro
LRM
30
1,100
0
28 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
38
118
0
25 Jul 2023
OBJECT 3DIT: Language-guided 3D-aware Image Editing
OBJECT 3DIT: Language-guided 3D-aware Image Editing
Oscar Michel
Anand Bhattad
Eli VanderBilt
Ranjay Krishna
Aniruddha Kembhavi
Tanmay Gupta
DiffM
30
39
0
20 Jul 2023
Distilling Knowledge from Text-to-Image Generative Models Improves
  Visio-Linguistic Reasoning in CLIP
Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP
S. Basu
S. Hu
Maziar Sanjabi
Daniela Massiceti
S. Feizi
VLM
21
3
0
18 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present,
  and Future
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Chaoyang Zhu
Long Chen
ObjD
VLM
31
32
0
18 Jul 2023
Unified Open-Vocabulary Dense Visual Prediction
Unified Open-Vocabulary Dense Visual Prediction
Hengcan Shi
Munawar Hayat
Jianfei Cai
ObjD
VLM
43
19
0
17 Jul 2023
Sim2Plan: Robot Motion Planning via Message Passing between Simulation
  and Reality
Sim2Plan: Robot Motion Planning via Message Passing between Simulation and Reality
Yizhou Zhao
Yuanhong Zeng
Qiang Long
Ying Nian Wu
Song-Chun Zhu
35
0
0
15 Jul 2023
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
  Resolution
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Mostafa Dehghani
Basil Mustafa
Josip Djolonga
Jonathan Heek
Matthias Minderer
...
Avital Oliver
Piotr Padlewski
A. Gritsenko
Mario Luvcić
N. Houlsby
ViT
23
105
0
12 Jul 2023
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
  Language Models
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang
Chen Wang
Ruohan Zhang
Yunzhu Li
Jiajun Wu
Li Fei-Fei
LM&Ro
33
480
0
12 Jul 2023
RoCo: Dialectic Multi-Robot Collaboration with Large Language Models
RoCo: Dialectic Multi-Robot Collaboration with Large Language Models
Zhao Mandi
Shreeya Jain
Shuran Song
LM&Ro
LLMAG
31
125
0
10 Jul 2023
Robots That Ask For Help: Uncertainty Alignment for Large Language Model
  Planners
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners
Allen Z. Ren
Anushri Dixit
Alexandra Bodrova
Sumeet Singh
Stephen Tu
...
Jacob Varley
Zhenjia Xu
Dorsa Sadigh
Andy Zeng
Anirudha Majumdar
LM&Ro
55
219
0
04 Jul 2023
RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation
  based on Visual Foundation Model
RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model
Keyan Chen
Chenyang Liu
Hao Chen
Haotian Zhang
Wenyuan Li
Zhengxia Zou
Z. Shi
VLM
16
202
0
28 Jun 2023
Towards Open Vocabulary Learning: A Survey
Towards Open Vocabulary Learning: A Survey
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Guohao Li
Dacheng Tao
ObjD
VLM
34
136
0
28 Jun 2023
DesCo: Learning Object Recognition with Rich Language Descriptions
DesCo: Learning Object Recognition with Rich Language Descriptions
Liunian Harold Li
Zi-Yi Dou
Nanyun Peng
Kai-Wei Chang
ObjD
VLM
26
20
0
24 Jun 2023
One-shot Imitation Learning via Interaction Warping
One-shot Imitation Learning via Interaction Warping
Ondrej Biza
Skye Thompson
Kishore Reddy Pagidi
Abhinav Kumar
Elise van der Pol
Robin G. Walters
Thomas Kipf
Jan-Willem van de Meent
Lawson L. S. Wong
Robert W. Platt
30
13
0
21 Jun 2023
CLARA: Classifying and Disambiguating User Commands for Reliable
  Interactive Robotic Agents
CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents
Jeongeun Park
Seungwon Lim
Joonhyung Lee
Sangbeom Park
Minsuk Chang
Youngjae Yu
Sungjoon Choi
LM&Ro
34
22
0
17 Jun 2023
Scaling Open-Vocabulary Object Detection
Scaling Open-Vocabulary Object Detection
Matthias Minderer
A. Gritsenko
N. Houlsby
VLM
ObjD
24
178
0
16 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to
  Enhance Visio-Linguistic Compositional Understanding
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
CoGe
VLM
31
9
0
15 Jun 2023
World-to-Words: Grounded Open Vocabulary Acquisition through Fast
  Mapping in Vision-Language Models
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
Ziqiao Ma
Jiayi Pan
J. Chai
ObjD
VLM
21
8
0
14 Jun 2023
GeneCIS: A Benchmark for General Conditional Image Similarity
GeneCIS: A Benchmark for General Conditional Image Similarity
S. Vaze
Nicolas Carion
Ishan Misra
VLM
DiffM
29
26
0
13 Jun 2023
Retrieval-Enhanced Contrastive Vision-Text Models
Retrieval-Enhanced Contrastive Vision-Text Models
Ahmet Iscen
Mathilde Caron
Alireza Fathi
Cordelia Schmid
CLIP
VLM
31
26
0
12 Jun 2023
LOWA: Localize Objects in the Wild with Attributes
LOWA: Localize Objects in the Wild with Attributes
Xiaoyuan Guo
Kezhen Chen
Jinmeng Rao
Yawen Zhang
Baochen Sun
Jie Yang
ObjD
40
2
0
31 May 2023
Multi-modal Queried Object Detection in the Wild
Multi-modal Queried Object Detection in the Wild
Yifan Xu
Mengdan Zhang
Chaoyou Fu
Peixian Chen
Xiaoshan Yang
Ke Li
Changsheng Xu
ObjD
VLM
30
30
0
30 May 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
56
187
0
29 May 2023
Z-GMOT: Zero-shot Generic Multiple Object Tracking
Z-GMOT: Zero-shot Generic Multiple Object Tracking
Kim Hoang Tran
Anh Duy Le Dinh
Tien-Phat Nguyen
Thinh Phan
Pha Nguyen
Khoa Luu
Don Adjeroh
Gianfranco Doretto
Ngan Hoang Le
VOT
33
5
0
28 May 2023
Modularized Zero-shot VQA with Pre-trained Models
Modularized Zero-shot VQA with Pre-trained Models
Rui Cao
Jing Jiang
LRM
27
2
0
27 May 2023
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
  Diffusion Models with Large Language Models
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
Long Lian
Boyi Li
Adam Yala
Trevor Darrell
43
152
0
23 May 2023
What Makes for Good Visual Tokenizers for Large Language Models?
What Makes for Good Visual Tokenizers for Large Language Models?
Guangzhi Wang
Yixiao Ge
Xiaohan Ding
Mohan S. Kankanhalli
Ying Shan
MLLM
VLM
27
38
0
20 May 2023
Semantic Anomaly Detection with Large Language Models
Semantic Anomaly Detection with Large Language Models
Amine Elhafsi
Rohan Sinha
Christopher Agia
Edward Schmerling
I. Nesnas
Marco Pavone
34
65
0
18 May 2023
Going Denser with Open-Vocabulary Part Segmentation
Going Denser with Open-Vocabulary Part Segmentation
Pei Sun
Shoufa Chen
Chenchen Zhu
Fanyi Xiao
Ping Luo
Saining Xie
Zhicheng Yan
ObjD
VLM
27
45
0
18 May 2023
Region-Aware Pretraining for Open-Vocabulary Object Detection with
  Vision Transformers
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
ViT
VLM
27
73
0
11 May 2023
Vision-Language Models in Remote Sensing: Current Progress and Future
  Trends
Vision-Language Models in Remote Sensing: Current Progress and Future Trends
Xiang Li
Congcong Wen
Yuan Hu
Zhenghang Yuan
Xiao Xiang Zhu
VLM
21
71
0
09 May 2023
TidyBot: Personalized Robot Assistance with Large Language Models
TidyBot: Personalized Robot Assistance with Large Language Models
Jimmy Wu
Rika Antonova
Adam Kan
Marion Lepert
Andy Zeng
Shuran Song
Jeannette Bohg
Szymon Rusinkiewicz
Thomas Funkhouser
LM&Ro
34
284
0
09 May 2023
ZeroSearch: Local Image Search from Text with Zero Shot Learning
ZeroSearch: Local Image Search from Text with Zero Shot Learning
Jatin Nainani
A. Mazumdar
Viraj Sheth
20
0
0
01 May 2023
A Cookbook of Self-Supervised Learning
A Cookbook of Self-Supervised Learning
Randall Balestriero
Mark Ibrahim
Vlad Sobal
Ari S. Morcos
Shashank Shekhar
...
Pierre Fernandez
Amir Bar
Hamed Pirsiavash
Yann LeCun
Micah Goldblum
SyDa
FedML
SSL
44
273
0
24 Apr 2023
End-to-End Spatio-Temporal Action Localisation with Video Transformers
End-to-End Spatio-Temporal Action Localisation with Video Transformers
A. Gritsenko
Xuehan Xiong
Josip Djolonga
Mostafa Dehghani
Chen Sun
Mario Lucic
Cordelia Schmid
Anurag Arnab
ViT
34
13
0
24 Apr 2023
Grounding Classical Task Planners via Vision-Language Models
Grounding Classical Task Planners via Vision-Language Models
Xiaohan Zhang
Yan Ding
S. Amiri
Hao Yang
Andy Kaminski
Chad Esselink
Shiqi Zhang
20
16
0
17 Apr 2023
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of
  Zoom and Spatial Biases in Image Classification
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification
Mohammad Reza Taesiri
Giang Nguyen
Sarra Habchi
C. Bezemer
Anh Totti Nguyen
VLM
34
20
0
11 Apr 2023
Mitigating Spurious Correlations in Multi-modal Models during
  Fine-tuning
Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
Yu Yang
Besmira Nushi
Hamid Palangi
Baharan Mirzasoleiman
39
36
0
08 Apr 2023
Training-Free Layout Control with Cross-Attention Guidance
Training-Free Layout Control with Cross-Attention Guidance
Minghao Chen
Iro Laina
Andrea Vedaldi
DiffM
135
222
0
06 Apr 2023
VicTR: Video-conditioned Text Representations for Activity Recognition
VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya
Anurag Arnab
Arsha Nagrani
Michael S. Ryoo
36
19
0
05 Apr 2023
Learning to Name Classes for Vision and Language Models
Learning to Name Classes for Vision and Language Models
Sarah Parisot
Yongxin Yang
Steven G. McDonagh
VLM
17
10
0
04 Apr 2023
Associating Spatially-Consistent Grouping with Text-supervised Semantic
  Segmentation
Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation
Yabo Zhang
Zihao Wang
Jun Hao Liew
Jingjia Huang
Manyu Zhu
Jiashi Feng
W. Zuo
VLM
19
4
0
03 Apr 2023
Open-Vocabulary Point-Cloud Object Detection without 3D Annotation
Yuheng Lu
Chenfeng Xu
Xi Wei
Xiaodong Xie
M. Tomizuka
Kurt Keutzer
Shanghang Zhang
3DPC
21
53
0
03 Apr 2023
Vision-Language Models for Vision Tasks: A Survey
Vision-Language Models for Vision Tasks: A Survey
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
41
483
0
03 Apr 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
29
23
0
29 Mar 2023
What Can Human Sketches Do for Object Detection?
What Can Human Sketches Do for Object Detection?
Pinaki Nath Chowdhury
A. Bhunia
Aneeshan Sain
Subhadeep Koley
Tao Xiang
Yi-Zhe Song
ObjD
31
32
0
27 Mar 2023
Prompt-Guided Transformers for End-to-End Open-Vocabulary Object
  Detection
Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection
Hwanjun Song
Jihwan Bang
VLM
ObjD
29
14
0
25 Mar 2023
Previous
12345
Next