Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2209.13306
Cited By
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
27 September 2022
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding"
44 / 44 papers shown
Title
AdaMixer: A Fast-Converging Query-Based Object Detector
Ziteng Gao
Limin Wang
Bing Han
Sheng Guo
ObjD
79
109
0
30 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
70
95
0
30 Mar 2022
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
Shilong Liu
Feng Li
Hao Zhang
Xiaohu Yang
Xianbiao Qi
Hang Su
Jun Zhu
Lei Zhang
ViT
286
755
0
28 Jan 2022
Anchor DETR: Query Design for Transformer-Based Object Detection
Yingming Wang
Xinming Zhang
Tong Yang
Jian Sun
ViT
42
54
0
15 Sep 2021
Conditional DETR for Fast Training Convergence
Depu Meng
Xiaokang Chen
Zejia Fan
Gang Zeng
Houqiang Li
Yuhui Yuan
Lei-huan Sun
Jingdong Wang
ViT
73
619
0
13 Aug 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
167
881
0
26 Apr 2021
Toward Transformer-Based Object Detection
Josh Beal
Eric Kim
Eric Tzeng
Dong Huk Park
Andrew Zhai
Dmitry Kislyuk
ViT
86
215
0
17 Dec 2020
End-to-End Object Detection with Adaptive Clustering Transformer
Minghang Zheng
Peng Gao
Renrui Zhang
Kunchang Li
Xiaogang Wang
Hongsheng Li
Hao Dong
ViT
143
199
0
18 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
115
421
0
14 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
47
97
0
10 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
68
172
0
01 Nov 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
216
5,073
0
08 Oct 2020
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding
Zhu Zhang
Zhou Zhao
Zhijie Lin
Baoxing Huai
Jing Yuan
61
35
0
16 Aug 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
523
608
0
21 Jul 2020
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT
3DV
PINN
379
13,025
0
26 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
106
503
0
01 May 2020
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Liujuan Cao
Chenglin Wu
Cheng Deng
Rongrong Ji
ObjD
236
291
0
19 Mar 2020
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
Zhu Zhang
Zhou Zhao
Yang Zhao
Qi. Wang
Huasheng Liu
Lianli Gao
64
115
0
19 Jan 2020
Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos
Yitian Yuan
Lin Ma
Jingwen Wang
Wei Liu
Wenwu Zhu
74
244
0
31 Oct 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
153
1,663
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
237
2,479
0
20 Aug 2019
Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
F. Saleh
Hongdong Li
Stephen Gould
59
147
0
20 Aug 2019
A Fast and Accurate One-Stage Approach to Visual Grounding
Zhengyuan Yang
Boqing Gong
Liwei Wang
Wenbing Huang
Dong Yu
Jiebo Luo
ObjD
46
362
0
18 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
200
902
0
16 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
224
3,678
0
06 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
600
24,431
0
26 Jul 2019
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Zhenfang Chen
Lin Ma
Wenhan Luo
Kwan-Yee K. Wong
81
103
0
06 Jun 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
77
1,246
0
03 Apr 2019
Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing
Xihui Liu
Zihao Wang
Jing Shao
Xiaogang Wang
Hongsheng Li
ObjD
70
182
0
03 Mar 2019
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression
S. Hamid Rezatofighi
Deyuan Li
JunYoung Gwak
Amir Sadeghian
Ian Reid
Silvio Savarese
147
4,163
0
25 Feb 2019
Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks
Peng Wang
Qi Wu
Jiewei Cao
Chunhua Shen
Lianli Gao
Anton Van Den Hengel
ObjD
78
255
0
12 Dec 2018
Learning to Assemble Neural Module Tree Networks for Visual Grounding
Daqing Liu
Hanwang Zhang
Feng Wu
Zhengjun Zha
49
270
0
08 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.7K
94,770
0
11 Oct 2018
MAttNet: Modular Attention Network for Referring Expression Comprehension
Licheng Yu
Zhe Lin
Xiaohui Shen
Jimei Yang
Xin Lu
Joey Tianyi Zhou
Tamara L. Berg
ObjD
97
827
0
24 Jan 2018
Object Referring in Videos with Language and Human Gaze
A. Vasudevan
Dengxin Dai
Luc Van Gool
VOS
57
75
0
04 Jan 2018
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
687
131,526
0
12 Jun 2017
TALL: Temporal Activity Localization via Language Query
J. Gao
Chen Sun
Zhenheng Yang
Ram Nevatia
123
819
0
05 May 2017
Spatio-temporal Person Retrieval via Natural Language Queries
Masataka Yamaguchi
Kuniaki Saito
Yoshitaka Ushiku
Tatsuya Harada
67
58
0
26 Apr 2017
Modeling Relationships in Referential Expressions with Compositional Modular Networks
Ronghang Hu
Marcus Rohrbach
Jacob Andreas
Trevor Darrell
Kate Saenko
75
406
0
30 Nov 2016
Modeling Context in Referring Expressions
Licheng Yu
Patrick Poirson
Shan Yang
Alexander C. Berg
Tamara L. Berg
127
1,263
0
31 Jul 2016
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
2.2K
193,878
0
10 Dec 2015
Grounding of Textual Phrases in Images by Reconstruction
Anna Rohrbach
Marcus Rohrbach
Ronghang Hu
Trevor Darrell
Bernt Schiele
75
497
0
12 Nov 2015
You Only Look Once: Unified, Real-Time Object Detection
Joseph Redmon
S. Divvala
Ross B. Girshick
Ali Farhadi
ObjD
686
36,935
0
08 Jun 2015
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Shaoqing Ren
Kaiming He
Ross B. Girshick
Jian Sun
AIMat
ObjD
499
62,270
0
04 Jun 2015
1