Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2412.13708
Cited By
v1
v2 (latest)
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
AAAI Conference on Artificial Intelligence (AAAI), 2024
18 December 2024
Taein Son
Soo Won Seo
Jisong Kim
S. Lee
Jun Won Choi
VGen
Re-assign community
ArXiv (abs)
PDF
HTML
Github (6★)
Papers citing
"JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts"
39 / 39 papers shown
STMixer: A One-Stage Sparse Action Detector
Tao Wu
Mengqing Cao
Ziteng Gao
Gangshan Wu
Limin Wang
264
39
0
15 Apr 2024
Efficient Video Action Detection with Token Dropout and Context Refinement
IEEE International Conference on Computer Vision (ICCV), 2023
Lei Chen
Zhan Tong
Yibing Song
Gangshan Wu
Limin Wang
361
31
0
17 Apr 2023
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Computer Vision and Pattern Recognition (CVPR), 2023
Limin Wang
Bingkun Huang
Zhiyu Zhao
Zhan Tong
Yinan He
Yi Wang
Yali Wang
Yu Qiao
VGen
515
623
0
29 Mar 2023
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Computer Vision and Pattern Recognition (CVPR), 2022
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Xiyang Dai
Lu Yuan
Yu-Gang Jiang
VGen
435
127
0
08 Dec 2022
Holistic Interaction Transformer Network for Action Detection
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Gueter Josmy Faure
Min-Hung Chen
S. Lai
358
50
0
23 Oct 2022
Expanding Language-Image Pretrained Models for General Video Recognition
European Conference on Computer Vision (ECCV), 2022
Bolin Ni
Houwen Peng
Minghao Chen
Songyang Zhang
Gaofeng Meng
Jianlong Fu
Shiming Xiang
Haibin Ling
VLM
CLIP
ViT
466
472
0
04 Aug 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Neural Information Processing Systems (NeurIPS), 2022
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
873
1,844
0
23 Mar 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
International Conference on Machine Learning (ICML), 2022
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
1.5K
6,390
0
28 Jan 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
Computer Vision and Pattern Recognition (CVPR), 2022
Chao-Yuan Wu
Yanghao Li
K. Mangalam
Haoqi Fan
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
527
262
0
20 Jan 2022
Masked Autoencoders Are Scalable Vision Learners
Computer Vision and Pattern Recognition (CVPR), 2021
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
2.8K
11,101
0
11 Nov 2021
Attention Bottlenecks for Multimodal Fusion
Neural Information Processing Systems (NeurIPS), 2021
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
710
754
0
30 Jun 2021
AST: Audio Spectrogram Transformer
Interspeech (Interspeech), 2021
Yuan Gong
Yu-An Chung
James R. Glass
ViT
768
1,256
0
05 Apr 2021
TubeR: Tubelet Transformer for Video Action Detection
Computer Vision and Pattern Recognition (CVPR), 2021
Jiaojiao Zhao
Yanyi Zhang
Xinyu Li
Hao Chen
Shuai Bing
...
Yuanjun Xiong
Davide Modolo
I. Marsic
Cees G. M. Snoek
Joseph Tighe
ViT
489
98
0
02 Apr 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Computer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.4K
1,443
0
17 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
International Conference on Machine Learning (ICML), 2021
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
1.5K
2,878
0
09 Feb 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
1.6K
60,663
0
22 Oct 2020
Pose And Joint-Aware Action Recognition
Anshul B. Shah
Shlok Kumar Mishra
Ankan Bansal
Jun-Cheng Chen
Ramalingam Chellappa
Abhinav Shrivastava
379
42
0
16 Oct 2020
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
Computer Vision and Pattern Recognition (CVPR), 2020
Junting Pan
Siyu Chen
Zheng Shou
Yu Liu
Jing Shao
Jiaming Song
3DPC
395
174
0
14 Jun 2020
Asynchronous Interaction Aggregation for Action Detection
European Conference on Computer Vision (ECCV), 2020
Jiajun Tang
Jinchao Xia
Xinzhi Mu
Bo Pang
Cewu Lu
337
133
0
16 Apr 2020
Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao
Yong Jae Lee
Kristen Grauman
Jitendra Malik
Christoph Feichtenhofer
766
241
0
23 Jan 2020
Listen to Look: Action Recognition by Previewing Audio
Computer Vision and Pattern Recognition (CVPR), 2019
Ruohan Gao
Tae-Hyun Oh
Kristen Grauman
Lorenzo Torresani
VLM
485
291
0
10 Dec 2019
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
Okan Kopuklu
Xiangyu Wei
Gerhard Rigoll
549
164
0
15 Nov 2019
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
IEEE International Conference on Computer Vision (ICCV), 2019
Evangelos Kazakos
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
317
390
0
22 Aug 2019
A Short Note on the Kinetics-700 Human Action Dataset
João Carreira
Eric Noland
Chloe Hillier
Andrew Zisserman
335
540
0
15 Jul 2019
Dance with Flow: Two-in-One Stream Action Detection
Jiaojiao Zhao
Cees G. M. Snoek
563
91
0
01 Apr 2019
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
716
4,061
0
10 Dec 2018
Video Action Transformer Network
Rohit Girdhar
João Carreira
Carl Doersch
Andrew Zisserman
ViT
634
768
0
06 Dec 2018
Robust Deep Multi-modal Learning Based on Gated Information Fusion Network
Asian Conference on Computer Vision (ACCV), 2018
Jaekyum Kim
Junho Koh
Yecheol Kim
Jaehyung Choi
Youngbae Hwang
Jun-Won Choi
290
78
0
17 Jul 2018
Attention Is All You Need
Neural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
8.3K
172,602
0
12 Jun 2017
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu
Chen Sun
David A. Ross
Carl Vondrick
C. Pantofaru
...
G. Toderici
Susanna Ricco
Rahul Sukthankar
Cordelia Schmid
Jitendra Malik
VGen
631
1,154
0
23 May 2017
The Kinetics Human Action Video Dataset
W. Kay
João Carreira
Karen Simonyan
Brian Zhang
Chloe Hillier
...
Tim Green
T. Back
Apostol Natsev
Mustafa Suleyman
Andrew Zisserman
881
4,369
0
19 May 2017
Feature Pyramid Networks for Object Detection
Nayeon Lee
Piotr Dollár
Ross B. Girshick
Kaiming He
Bharath Hariharan
Serge J. Belongie
ObjD
1.6K
26,332
0
09 Dec 2016
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
Suman Saha
Gurkirt Singh
Michael Sapienza
Juil Sock
Fabio Cuzzolin
ViT
345
215
0
04 Aug 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
...
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
3.5K
6,424
0
23 Feb 2016
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015
Shaoqing Ren
Kaiming He
Ross B. Girshick
Jian Sun
AIMat
ObjD
3.6K
71,917
0
04 Jun 2015
Finding Action Tubes
Computer Vision and Pattern Recognition (CVPR), 2014
Georgia Gkioxari
Jitendra Malik
391
605
0
21 Nov 2014
ImageNet Large Scale Visual Recognition Challenge
International Journal of Computer Vision (IJCV), 2014
Olga Russakovsky
Gaowen Liu
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLM
ObjD
3.7K
42,317
0
01 Sep 2014
Microsoft COCO: Common Objects in Context
European Conference on Computer Vision (ECCV), 2014
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
27.3K
51,723
0
01 May 2014
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro
Amir Zamir
M. Shah
CLIP
VGen
1.1K
7,015
0
03 Dec 2012
1
Page 1 of 1