Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.12370
Cited By
v1
v2 (latest)
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
22 March 2023
Sixun Dong
Huazhang Hu
Dongze Lian
Weixin Luo
Yichen Qian
Shenghua Gao
ViT
AI4TS
Re-assign community
ArXiv (abs)
PDF
HTML
Github (29★)
Papers citing
"Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos"
34 / 34 papers shown
Title
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
71
70
0
12 Oct 2022
Expanding Language-Image Pretrained Models for General Video Recognition
Bolin Ni
Houwen Peng
Minghao Chen
Songyang Zhang
Gaofeng Meng
Jianlong Fu
Shiming Xiang
Haibin Ling
VLM
CLIP
ViT
106
326
0
04 Aug 2022
LocVTP: Video-Text Pre-training for Temporal Localization
Meng Cao
Tianyu Yang
Junwu Weng
Can Zhang
Jue Wang
Yuexian Zou
69
64
0
21 Jul 2022
Temporal Alignment Networks for Long-term Video
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
84
87
0
06 Apr 2022
TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting
Huazhang Hu
Sixun Dong
Yiqun Zhao
Dongze Lian
Zhengxin Li
Shenghua Gao
67
50
0
03 Apr 2022
CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection
Yanan Zhang
Jiaxin Chen
Di Huang
ViT
3DPC
94
58
0
01 Apr 2022
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning
Minghao Chen
Fangyun Wei
Chong Li
Deng Cai
AI4TS
96
34
0
28 Mar 2022
Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities
Fadime Sener
Dibyadip Chatterjee
Daniel Shelepov
Kun He
Dipika Singhania
Robert Y. Wang
Angela Yao
VGen
81
214
0
28 Mar 2022
Learning To Recognize Procedural Activities with Distant Supervision
Xudong Lin
Fabio Petroni
Gedas Bertasius
Marcus Rohrbach
Shih-Fu Chang
Lorenzo Torresani
79
87
0
26 Jan 2022
CLIP-Event: Connecting Text and Images with Event Structures
Manling Li
Ruochen Xu
Shuohang Wang
Luowei Zhou
Xudong Lin
Chenguang Zhu
Michael Zeng
Heng Ji
Shih-Fu Chang
VLM
CLIP
65
127
0
13 Jan 2022
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
465
7,757
0
11 Nov 2021
Cross-Modality Fusion Transformer for Multispectral Object Detection
Q. Fang
D. Han
Zhaokui Wang
ViT
73
149
0
30 Oct 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
311
578
0
28 Sep 2021
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
106
1,482
0
24 Jun 2021
Unsupervised Action Segmentation by Joint Representation Learning and Online Clustering
Sateesh Kumar
S. Haresh
Awais Ahmed
Andrey Konin
M. Zia
Quoc-Huy Tran
SSL
66
48
0
27 May 2021
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
51
132
0
20 May 2021
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
132
1,259
0
22 Apr 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
387
2,053
0
09 Feb 2021
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
122
422
0
14 Nov 2020
The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose
Yizhak Ben-Shabat
Xin Yu
F. Saleh
Dylan Campbell
Cristian Rodriguez-Opazo
Hongdong Li
Stephen Gould
69
114
0
01 Jul 2020
Counting Out Time: Class Agnostic Video Repetition Counting in the Wild
Debidatta Dwibedi
Y. Aytar
Jonathan Tompson
P. Sermanet
Andrew Zisserman
AI4TS
59
113
0
27 Jun 2020
MPNet: Masked and Permuted Pre-training for Language Understanding
Kaitao Song
Xu Tan
Tao Qin
Jianfeng Lu
Tie-Yan Liu
102
1,121
0
20 Apr 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
VGen
SSL
126
712
0
13 Dec 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
110
1,200
0
07 Jun 2019
Unsupervised learning of action classes with continuous temporal embedding
Anna Kukleva
Hilde Kuehne
Fadime Sener
Juergen Gall
68
107
0
08 Apr 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
230
996
0
01 Apr 2019
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
166
3,274
0
10 Dec 2018
Temporal Relational Reasoning in Videos
Bolei Zhou
A. Andonian
Aude Oliva
Antonio Torralba
NAI
96
1,039
0
22 Nov 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
235
8,019
0
22 May 2017
Temporal Segment Networks for Action Recognition in Videos
Limin Wang
Yuanjun Xiong
Zhe Wang
Yu Qiao
Dahua Lin
Xiaoou Tang
Luc Van Gool
ViT
114
810
0
08 May 2017
Temporal Convolutional Networks for Action Segmentation and Detection
Colin S. Lea
Michael D. Flynn
René Vidal
A. Reiter
Gregory Hager
95
1,492
0
16 Nov 2016
Categorical Reparameterization with Gumbel-Softmax
Eric Jang
S. Gu
Ben Poole
BDL
339
5,364
0
03 Nov 2016
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Limin Wang
Yuanjun Xiong
Zhe Wang
Yu Qiao
Dahua Lin
Xiaoou Tang
Luc Van Gool
ViT
105
3,835
0
02 Aug 2016
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
VLM
326
18,625
0
06 Feb 2015
1