Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2109.08472
Cited By
ActionCLIP: A New Paradigm for Video Action Recognition
17 September 2021
Mengmeng Wang
Jiazheng Xing
Yong Liu
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ActionCLIP: A New Paradigm for Video Action Recognition"
50 / 59 papers shown
Title
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Zichen Liu
Kunlun Xu
Fuchun Sun
Xu Zou
Yuxin Peng
Jiahuan Zhou
VLM
AI4TS
148
2
0
20 Mar 2025
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
Wanhua Li
Renping Zhou
Jiawei Zhou
Yingwei Song
Johannes Herter
Minghan Qin
Gao Huang
Hanspeter Pfister
3DGS
VLM
119
2
0
13 Mar 2025
From Text to Trajectory: Exploring Complex Constraint Representation and Decomposition in Safe Reinforcement Learning
Pusen Dong
Tianchen Zhu
Yue Qiu
Haoyi Zhou
Jianxin Li
142
1
0
24 Feb 2025
Conformal Predictions for Human Action Recognition with Vision-Language Models
Bary Tim
Fuchs Clément
Macq Benoît
VLM
117
0
0
10 Feb 2025
Can masking background and object reduce static bias for zero-shot action recognition?
Takumi Fukuzawa
Kensho Hara
Hirokatsu Kataoka
Toru Tamaki
97
1
0
22 Jan 2025
Dynamic Scene Understanding from Vision-Language Representations
Shahaf Pruss
Morris Alper
Hadar Averbuch-Elor
OCL
447
0
0
20 Jan 2025
Detection, Retrieval, and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT
Wen-Dong Jiang
Chih-Yung Chang
Diptendu Sinha Roy
118
0
0
07 Jan 2025
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Shehan Munasinghe
Hanan Gani
Wenqi Zhu
Jiale Cao
Eric P. Xing
Fahad Shahbaz Khan
Salman Khan
MLLM
VGen
VLM
78
6
0
07 Nov 2024
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Tz-Ying Wu
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
EgoV
111
0
0
28 Jul 2024
MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation
Sanggeon Yun
Ryozo Masukawa
Minhyoung Na
Mohsen Imani
96
8
0
27 Jun 2024
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Stefan Gerd Fritsch
Cennet Oğuz
Vitor Fortes Rey
L. Ray
Maximilian Kiefer-Emmanouilidis
Paul Lukowicz
HAI
75
0
0
06 Jun 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
79
2
0
22 Apr 2024
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
Pengfei Liu
Weizhe Yuan
Jinlan Fu
Zhengbao Jiang
Hiroaki Hayashi
Graham Neubig
VLM
SyDa
204
3,971
0
28 Jul 2021
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
194
1,960
0
16 Jul 2021
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Han Fang
Pengfei Xiong
Luhui Xu
Yu Chen
CLIP
VLM
101
298
0
21 Jun 2021
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
132
1,259
0
22 Apr 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
222
2,150
0
29 Mar 2021
An Image is Worth 16x16 Words, What is a Video Worth?
Gilad Sharir
Asaf Noy
Lihi Zelnik-Manor
ViT
72
125
0
25 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
925
29,436
0
26 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
121
664
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
367
2,053
0
09 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
119
1,745
0
05 Feb 2021
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
264
432
0
01 Feb 2021
TDN: Temporal Difference Networks for Efficient Action Recognition
Limin Wang
Zhan Tong
Bin Ji
Gangshan Wu
98
397
0
18 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
651
41,103
0
22 Oct 2020
Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition
Sudhakar Kumawat
Manisha Verma
Yuta Nakashima
Shanmuganathan Raman
169
43
0
22 Jul 2020
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
124
374
0
29 Jun 2020
SmallBigNet: Integrating Core and Contextual Views for Video Classification
Xianhang Li
Yali Wang
Zhipeng Zhou
Yu Qiao
ViT
65
91
0
25 Jun 2020
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
128
1,019
0
09 Apr 2020
Temporal Pyramid Network for Action Recognition
Ceyuan Yang
Yinghao Xu
Jianping Shi
Bo Dai
Bolei Zhou
47
372
0
07 Apr 2020
TEA: Temporal Excitation and Aggregation for Action Recognition
Yan-Ran Li
Bin Ji
Xintian Shi
Jianguo Zhang
Bin Kang
Limin Wang
ViT
84
447
0
03 Apr 2020
Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Biagio Brattoli
Joseph Tighe
Fedor Zhdanov
Pietro Perona
Krzysztof Chalupka
VLM
165
130
0
03 Mar 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
VGen
SSL
119
711
0
13 Dec 2019
TEINet: Towards an Efficient Architecture for Video Recognition
Zhaoyang Liu
Donghao Luo
Yabiao Wang
Limin Wang
Ying Tai
Chengjie Wang
Jilin Li
Feiyue Huang
Tong Lu
ViT
77
240
0
21 Nov 2019
STM: SpatioTemporal and Motion Encoding for Action Recognition
Boyuan Jiang
Mengmeng Wang
Weihao Gan
Wei Wu
Junjie Yan
79
382
0
07 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
226
3,684
0
06 Aug 2019
Video Modeling with Correlation Networks
Heng Wang
Du Tran
Lorenzo Torresani
Matt Feiszli
64
129
0
07 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
110
1,200
0
07 Jun 2019
Spatiotemporal Pyramid Network for Video Action Recognition
Yunbo Wang
Mingsheng Long
Jianmin Wang
Philip S. Yu
86
229
0
04 Mar 2019
D3D: Distilled 3D Networks for Video Action Recognition
Jonathan C. Stroud
David A. Ross
Chen Sun
Jia Deng
Rahul Sukthankar
3DPC
54
160
0
19 Dec 2018
Long-Term Feature Banks for Detailed Video Understanding
Chao-Yuan Wu
Christoph Feichtenhofer
Haoqi Fan
Kaiming He
Philipp Krahenbuhl
Ross B. Girshick
169
480
0
12 Dec 2018
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
164
3,274
0
10 Dec 2018
Timeception for Complex Action Recognition
Noureldien Hussein
E. Gavves
A. Smeulders
101
214
0
04 Dec 2018
TSM: Temporal Shift Module for Efficient Video Understanding
Ji Lin
Chuang Gan
Song Han
98
1,691
0
20 Nov 2018
Cross-Modal and Hierarchical Modeling of Video and Text
Bowen Zhang
Hexiang Hu
Fei Sha
BDL
AI4TS
58
191
0
16 Oct 2018
Spatio-Temporal Channel Correlation Networks for Action Classification
Ali Diba
Mohsen Fayyaz
Vivek Sharma
M. M. Arzani
Rahman Yousefzadeh
Juergen Gall
Luc Van Gool
3DPC
65
181
0
19 Jun 2018
Videos as Space-Time Region Graphs
Xinyu Wang
Abhinav Gupta
101
756
0
05 Jun 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
70
234
0
07 Apr 2018
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Patrick Murphy
3DH
142
1,330
0
13 Dec 2017
A Closer Look at Spatiotemporal Convolutions for Action Recognition
Du Tran
Heng Wang
Lorenzo Torresani
Jamie Ray
Yann LeCun
Manohar Paluri
215
3,030
0
30 Nov 2017
1
2
Next