ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.16434
  4. Cited By
TubeDETR: Spatio-Temporal Video Grounding with Transformers
v1v2 (latest)

TubeDETR: Spatio-Temporal Video Grounding with Transformers

30 March 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTML

Papers citing "TubeDETR: Spatio-Temporal Video Grounding with Transformers"

40 / 90 papers shown
Title
Dynamic Graph Attention for Referring Expression Comprehension
Dynamic Graph Attention for Referring Expression Comprehension
Sibei Yang
Guanbin Li
Yizhou Yu
OCL
66
218
0
18 Sep 2019
Temporally Grounding Language Queries in Videos by Contextual
  Boundary-aware Prediction
Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction
Jingwen Wang
Lin Ma
Wenhao Jiang
76
182
0
11 Sep 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLMMLLMSSL
160
1,666
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLMMLLM
247
2,483
0
20 Aug 2019
Proposal-free Temporal Moment Localization of a Natural-Language Query
  in Video using Guided Attention
Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
F. Saleh
Hongdong Li
Stephen Gould
66
147
0
20 Aug 2019
A Fast and Accurate One-Stage Approach to Visual Grounding
A Fast and Accurate One-Stage Approach to Visual Grounding
Zhengyuan Yang
Boqing Gong
Liwei Wang
Wenbing Huang
Dong Yu
Jiebo Luo
ObjD
56
362
0
18 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSLVLMMLLM
207
905
0
16 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSLVLM
231
3,684
0
06 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
665
24,464
0
26 Jul 2019
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Zhenfang Chen
Lin Ma
Wenhan Luo
Kwan-Yee K. Wong
95
103
0
06 Jun 2019
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in
  Videos
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
Zhu Zhang
Zhijie Lin
Zhou Zhao
Zhenxin Xiao
51
213
0
06 Jun 2019
Weakly Supervised Video Moment Retrieval From Text Queries
Weakly Supervised Video Moment Retrieval From Text Queries
Niluthpol Chowdhury Mithun
S. Paul
Amit K. Roy-Chowdhury
120
194
0
05 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLMSSL
79
1,246
0
03 Apr 2019
Improving Referring Expression Grounding with Cross-modal
  Attention-guided Erasing
Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing
Xihui Liu
Zihao Wang
Jing Shao
Xiaogang Wang
Hongsheng Li
ObjD
78
182
0
03 Mar 2019
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding
  Natural Language Descriptions in Videos
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
Dongliang He
Xiang Zhao
Jizhou Huang
Fu Li
Xiao-Chang Liu
Shilei Wen
66
153
0
21 Jan 2019
Neighbourhood Watch: Referring Expression Comprehension via
  Language-guided Graph Attention Networks
Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks
Peng Wang
Qi Wu
Jiewei Cao
Chunhua Shen
Lianli Gao
Anton Van Den Hengel
ObjD
86
255
0
12 Dec 2018
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
166
3,274
0
10 Dec 2018
MAN: Moment Alignment Network for Natural Language Moment Retrieval via
  Iterative Graph Adjustment
MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment
Da Zhang
Xiyang Dai
Xin Eric Wang
Yuan-fang Wang
L. Davis
68
305
0
30 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
94,891
0
11 Oct 2018
Localizing Moments in Video with Temporal Language
Localizing Moments in Video with Temporal Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
82
159
0
05 Sep 2018
To Find Where You Talk: Temporal Sentence Localization in Video with
  Attention Based Location Regression
To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression
Yitian Yuan
Tao Mei
Wenwu Zhu
80
333
0
19 Apr 2018
End-to-End Dense Video Captioning with Masked Transformer
End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou
Yingbo Zhou
Jason J. Corso
R. Socher
Caiming Xiong
92
529
0
03 Apr 2018
MAttNet: Modular Attention Network for Referring Expression
  Comprehension
MAttNet: Modular Attention Network for Referring Expression Comprehension
Licheng Yu
Zhe Lin
Xiaohui Shen
Jimei Yang
Xin Lu
Joey Tianyi Zhou
Tamara L. Berg
ObjD
97
828
0
24 Jan 2018
Object Referring in Videos with Language and Human Gaze
Object Referring in Videos with Language and Human Gaze
A. Vasudevan
Dengxin Dai
Luc Van Gool
VOS
63
75
0
04 Jan 2018
Grounding Referring Expressions in Images by Variational Context
Grounding Referring Expressions in Images by Variational Context
Hanwang Zhang
Yulei Niu
Shih-Fu Chang
BDLObjD
61
220
0
05 Dec 2017
Parallel Attention: A Unified Framework for Visual Object Discovery
  through Dialogs and Queries
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Bohan Zhuang
Qi Wu
Chunhua Shen
Ian Reid
Anton Van Den Hengel
ObjD
57
134
0
17 Nov 2017
Localizing Moments in Video with Natural Language
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
115
946
0
04 Aug 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
713
131,652
0
12 Jun 2017
TALL: Temporal Activity Localization via Language Query
TALL: Temporal Activity Localization via Language Query
J. Gao
Chen Sun
Zhenheng Yang
Ram Nevatia
123
820
0
05 May 2017
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
Fanyi Xiao
Leonid Sigal
Yong Jae Lee
63
139
0
03 May 2017
Spatio-temporal Person Retrieval via Natural Language Queries
Spatio-temporal Person Retrieval via Natural Language Queries
Masataka Yamaguchi
Kuniaki Saito
Yoshitaka Ushiku
Tatsuya Harada
72
58
0
26 Apr 2017
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
Licheng Yu
Hao Tan
Joey Tianyi Zhou
Tamara L. Berg
ObjD
94
275
0
30 Dec 2016
Modeling Relationships in Referential Expressions with Compositional
  Modular Networks
Modeling Relationships in Referential Expressions with Compositional Modular Networks
Ronghang Hu
Marcus Rohrbach
Jacob Andreas
Trevor Darrell
Kate Saenko
82
406
0
30 Nov 2016
Modeling Context Between Objects for Referring Expression Understanding
Modeling Context Between Objects for Referring Expression Understanding
Varun K. Nagaraja
Vlad I. Morariu
Larry S. Davis
72
151
0
01 Aug 2016
Layer Normalization
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
413
10,494
0
21 Jul 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense
  Image Annotations
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
...
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
217
5,747
0
23 Feb 2016
Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
2.2K
194,020
0
10 Dec 2015
Natural Language Object Retrieval
Natural Language Object Retrieval
Ronghang Hu
Huazhe Xu
Marcus Rohrbach
Jiashi Feng
Kate Saenko
Trevor Darrell
ObjD
97
553
0
13 Nov 2015
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Julia Hockenmaier
Svetlana Lazebnik
199
2,060
0
19 May 2015
Microsoft COCO Captions: Data Collection and Evaluation Server
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
215
2,478
0
01 Apr 2015
Previous
12