Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2203.16434
Cited By
v1
v2 (latest)
TubeDETR: Spatio-Temporal Video Grounding with Transformers
30 March 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"TubeDETR: Spatio-Temporal Video Grounding with Transformers"
50 / 90 papers shown
Title
Action tube generation by person query matching for spatio-temporal action detection
Kazuki Omi
Jion Oshima
Toru Tamaki
130
0
0
17 Mar 2025
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
104
8
0
21 Mar 2024
History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Cordelia Schmid
Ivan Laptev
LM&Ro
77
230
0
25 Oct 2021
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
ViT
68
64
0
20 Jul 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
86
279
0
09 Jun 2021
MERLOT: Multimodal Neural Script Knowledge Models
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLM
LRM
102
381
0
04 Jun 2021
End-to-End Video Object Detection with Spatial-Temporal Transformers
Lu He
Qianyu Zhou
Hefei Ling
Li Niu
Guangliang Cheng
Xiao Li
Wenxuan Liu
Yu Tong
Lizhuang Ma
Liqing Zhang
ViT
110
97
0
23 May 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
172
883
0
26 Apr 2021
VidTr: Video Transformer Without Convolutions
Yanyi Zhang
Xinyu Li
Chunhui Liu
Bing Shuai
Yi Zhu
Biagio Brattoli
Hao Chen
I. Marsic
Joseph Tighe
ViT
219
196
0
23 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
ViT
316
589
0
22 Apr 2021
TransVG: End-to-End Visual Grounding with Transformers
Jiajun Deng
Zhengyuan Yang
Tianlang Chen
Wen-gang Zhou
Houqiang Li
ViT
74
342
0
17 Apr 2021
Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding
Binbin Huang
Dongze Lian
Weixin Luo
Shenghua Gao
ObjD
70
94
0
09 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
149
1,176
0
01 Apr 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
222
2,150
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
453
21,439
0
25 Mar 2021
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding
Yongfei Liu
Bo Wan
Lin Ma
Xuming He
ObjD
82
56
0
24 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
931
29,436
0
26 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
124
664
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
389
2,053
0
09 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
126
1,749
0
05 Feb 2021
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
122
422
0
14 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
61
98
0
10 Nov 2020
Long Range Arena: A Benchmark for Efficient Transformers
Yi Tay
Mostafa Dehghani
Samira Abnar
Songlin Yang
Dara Bahri
Philip Pham
J. Rao
Liu Yang
Sebastian Ruder
Donald Metzler
147
720
0
08 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
73
172
0
01 Nov 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
657
41,103
0
22 Oct 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
224
5,080
0
08 Oct 2020
Rethinking Attention with Performers
K. Choromanski
Valerii Likhosherstov
David Dohan
Xingyou Song
Andreea Gane
...
Afroz Mohiuddin
Lukasz Kaiser
David Belanger
Lucy J. Colwell
Adrian Weller
184
1,597
0
30 Sep 2020
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
546
2,086
0
28 Jul 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
535
610
0
21 Jul 2020
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
Liwei Wang
Jing-ling Huang
Yin Li
Kun Xu
Zhengyuan Yang
Dong Yu
ObjD
62
83
0
03 Jul 2020
VirTex: Learning Visual Representations from Textual Annotations
Karan Desai
Justin Johnson
SSL
VLM
151
435
0
11 Jun 2020
Linformer: Self-Attention with Linear Complexity
Sinong Wang
Belinda Z. Li
Madian Khabsa
Han Fang
Hao Ma
216
1,706
0
08 Jun 2020
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT
3DV
PINN
421
13,048
0
26 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
118
503
0
01 May 2020
Span-based Localizing Network for Natural Language Video Localization
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
87
315
0
29 Apr 2020
Lite Transformer with Long-Short Range Attention
Zhanghao Wu
Zhijian Liu
Ji Lin
Chengyue Wu
Song Han
60
322
0
24 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
108
1,941
0
13 Apr 2020
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
174
4,071
0
10 Apr 2020
Dense Regression Network for Video Grounding
Runhao Zeng
Haoming Xu
Wenbing Huang
Peihao Chen
Mingkui Tan
Chuang Gan
79
283
0
07 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
143
440
0
02 Apr 2020
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Liujuan Cao
Chenglin Wu
Cheng Deng
Rongrong Ji
ObjD
253
291
0
19 Mar 2020
Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao
Yong Jae Lee
Kristen Grauman
Jitendra Malik
Christoph Feichtenhofer
254
209
0
23 Jan 2020
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
Zhu Zhang
Zhou Zhao
Yang Zhao
Qi. Wang
Huasheng Liu
Lianli Gao
67
118
0
19 Jan 2020
Reformer: The Efficient Transformer
Nikita Kitaev
Lukasz Kaiser
Anselm Levskaya
VLM
195
2,320
0
13 Jan 2020
Meshed-Memory Transformer for Image Captioning
Marcella Cornia
Matteo Stefanini
Lorenzo Baraldi
Rita Cucchiara
75
882
0
17 Dec 2019
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
VGen
SSL
126
712
0
13 Dec 2019
Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language
Songyang Zhang
Houwen Peng
Jianlong Fu
Jiebo Luo
75
470
0
08 Dec 2019
12-in-1: Multi-Task Vision and Language Representation Learning
Jiasen Lu
Vedanuj Goswami
Marcus Rohrbach
Devi Parikh
Stefan Lee
VLM
ObjD
95
481
0
05 Dec 2019
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
Zhijie Lin
Zhou Zhao
Zhu Zhang
Qi. Wang
Huasheng Liu
64
149
0
19 Nov 2019
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
352
941
0
24 Sep 2019
1
2
Next