Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.14173
Cited By
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
23 May 2023
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
Re-assign community
ArXiv
PDF
HTML
Papers citing
"TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale"
33 / 33 papers shown
Title
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
112
0
0
02 Apr 2025
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
155
3
0
16 Dec 2024
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
93
166
0
28 Mar 2023
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
Limin Wang
Yu Qiao
ViT
79
111
0
17 Nov 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
73
151
0
15 Sep 2022
Expanding Language-Image Pretrained Models for General Video Recognition
Bolin Ni
Houwen Peng
Minghao Chen
Songyang Zhang
Gaofeng Meng
Jianlong Fu
Shiming Xiang
Haibin Ling
VLM
CLIP
ViT
92
324
0
04 Aug 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
Junting Pan
Ziyi Lin
Xiatian Zhu
Jing Shao
Hongsheng Li
74
202
0
27 Jun 2022
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
64
34
0
10 May 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
40
43
0
26 Apr 2022
Visual Prompt Tuning
Menglin Jia
Luming Tang
Bor-Chun Chen
Claire Cardie
Serge Belongie
Bharath Hariharan
Ser-Nam Lim
VLM
VPVLM
136
1,615
0
23 Mar 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
57
109
0
13 Jan 2022
Prompting Visual-Language Models for Efficient Video Understanding
Chen Ju
Tengda Han
Kunhao Zheng
Ya Zhang
Weidi Xie
VPVLM
VLM
71
375
0
08 Dec 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
85
220
0
24 Nov 2021
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
66
193
0
19 Nov 2021
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou
Chen Wei
Huiyu Wang
Wei Shen
Cihang Xie
Alan Yuille
Tao Kong
70
729
0
15 Nov 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
309
576
0
28 Sep 2021
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Xingyi Cheng
Hezheng Lin
Xiangyu Wu
Fan Yang
Dong Shen
51
154
0
09 Sep 2021
Elaborative Rehearsal for Zero-shot Action Recognition
Shizhe Chen
Dong Huang
VLM
65
96
0
05 Aug 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
133
1,172
0
01 Apr 2021
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples
Tian Pan
Yibing Song
Tianyu Yang
Wenhao Jiang
Wei Liu
70
225
0
10 Mar 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
114
661
0
11 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
419
3,826
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
362
2,039
0
09 Feb 2021
Self-supervised Co-training for Video Representation Learning
Tengda Han
Weidi Xie
Andrew Zisserman
SSL
238
317
0
19 Oct 2020
Self-supervised Video Representation Learning by Pace Prediction
Jiangliu Wang
Jianbo Jiao
Yunhui Liu
SSL
AI4TS
69
235
0
13 Aug 2020
Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Biagio Brattoli
Joseph Tighe
Fedor Zhdanov
Pietro Perona
Krzysztof Chalupka
VLM
158
130
0
03 Mar 2020
Video Representation Learning by Dense Predictive Coding
Tengda Han
Weidi Xie
Andrew Zisserman
SSL
86
361
0
10 Sep 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
105
1,199
0
07 Jun 2019
Towards Universal Representation for Unseen Action Recognition
Yi Zhu
Yang Long
Yu Guan
Shawn D. Newsam
Ling Shao
AI4TS
80
104
0
22 Mar 2018
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
107
946
0
04 Aug 2017
Alternative Semantic Representations for Zero-Shot Human Action Recognition
Qian Wang
Ke Chen
VLM
52
73
0
28 Jun 2017
A Dataset for Movie Description
Anna Rohrbach
Marcus Rohrbach
Niket Tandon
Bernt Schiele
VGen
100
499
0
12 Jan 2015
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro
Amir Zamir
M. Shah
CLIP
VGen
117
6,134
0
03 Dec 2012
1