v1v2 (latest)

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

AAAI Conference on Artificial Intelligence (AAAI), 2024

18 December 2024

ArXiv (abs)PDF HTML Github (6★)

Papers citing "JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts"

39 / 39 papers shown

STMixer: A One-Stage Sparse Action Detector

Tao Wu

Mengqing Cao

Ziteng Gao

Gangshan Wu

Limin Wang

264

15 Apr 2024

Efficient Video Action Detection with Token Dropout and Context RefinementIEEE International Conference on Computer Vision (ICCV), 2023

Lei Chen

Zhan Tong

Yibing Song

Gangshan Wu

Limin Wang

361

17 Apr 2023

VideoMAE V2: Scaling Video Masked Autoencoders with Dual MaskingComputer Vision and Pattern Recognition (CVPR), 2023

Yi Wang

Yu Qiao

515

623

29 Mar 2023

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation LearningComputer Vision and Pattern Recognition (CVPR), 2022

Zuxuan Wu

Lu Yuan

435

127

08 Dec 2022

Holistic Interaction Transformer Network for Action DetectionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

Gueter Josmy Faure

Min-Hung Chen

S. Lai

358

23 Oct 2022

Expanding Language-Image Pretrained Models for General Video RecognitionEuropean Conference on Computer Vision (ECCV), 2022

466

472

04 Aug 2022

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-TrainingNeural Information Processing Systems (NeurIPS), 2022

873

1,844

23 Mar 2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationInternational Conference on Machine Learning (ICML), 2022

1.5K

6,390

28 Jan 2022

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2022

Christoph Feichtenhofer

ViT

527

262

20 Jan 2022

Masked Autoencoders Are Scalable Vision LearnersComputer Vision and Pattern Recognition (CVPR), 2021

Piotr Dollár

2.8K

11,101

11 Nov 2021

Attention Bottlenecks for Multimodal FusionNeural Information Processing Systems (NeurIPS), 2021

710

754

30 Jun 2021

AST: Audio Spectrogram TransformerInterspeech (Interspeech), 2021

768

1,256

05 Apr 2021

TubeR: Tubelet Transformer for Video Action DetectionComputer Vision and Pattern Recognition (CVPR), 2021

Hao Chen

...

489

02 Apr 2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsComputer Vision and Pattern Recognition (CVPR), 2021

1.4K

1,443

17 Feb 2021

Is Space-Time Attention All You Need for Video Understanding?International Conference on Machine Learning (ICML), 2021

Gedas Bertasius

Heng Wang

Lorenzo Torresani

ViT

1.5K

2,878

09 Feb 2021

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy

...

1.6K

60,663

22 Oct 2020

Pose And Joint-Aware Action Recognition

379

16 Oct 2020

Actor-Context-Actor Relation Network for Spatio-Temporal Action LocalizationComputer Vision and Pattern Recognition (CVPR), 2020

Siyu Chen

395

174

14 Jun 2020

Asynchronous Interaction Aggregation for Action DetectionEuropean Conference on Computer Vision (ECCV), 2020

337

133

16 Apr 2020

Audiovisual SlowFast Networks for Video Recognition

Christoph Feichtenhofer

766

241

23 Jan 2020

Listen to Look: Action Recognition by Previewing AudioComputer Vision and Pattern Recognition (CVPR), 2019

485

291

10 Dec 2019

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

Okan Kopuklu

Xiangyu Wei

Gerhard Rigoll

549

164

15 Nov 2019

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action RecognitionIEEE International Conference on Computer Vision (ICCV), 2019

Dima Damen

317

390

22 Aug 2019

A Short Note on the Kinetics-700 Human Action Dataset

335

540

15 Jul 2019

Dance with Flow: Two-in-One Stream Action Detection

Jiaojiao Zhao

Cees G. M. Snoek

563

01 Apr 2019

SlowFast Networks for Video Recognition

Christoph Feichtenhofer

Haoqi Fan

Jitendra Malik

Kaiming He

716

4,061

10 Dec 2018

Video Action Transformer Network

634

768

06 Dec 2018

Robust Deep Multi-modal Learning Based on Gated Information Fusion NetworkAsian Conference on Computer Vision (ACCV), 2018

290

17 Jul 2018

Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017

8.3K

172,602

12 Jun 2017

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

...

631

1,154

23 May 2017

The Kinetics Human Action Video Dataset

...

881

4,369

19 May 2017

Feature Pyramid Networks for Object Detection

Piotr Dollár

1.6K

26,332

09 Dec 2016

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

345

215

04 Aug 2016

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

...

Fei-Fei Li

3.5K

6,424

23 Feb 2016

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015

3.6K

71,917

04 Jun 2015

Finding Action TubesComputer Vision and Pattern Recognition (CVPR), 2014

Georgia Gkioxari

Jitendra Malik

391

605

21 Nov 2014

ImageNet Large Scale Visual Recognition ChallengeInternational Journal of Computer Vision (IJCV), 2014

...

Li Fei-Fei

3.7K

42,317

01 Sep 2014

Microsoft COCO: Common Objects in ContextEuropean Conference on Computer Vision (ECCV), 2014

Piotr Dollár

27.3K

51,723

01 May 2014

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

1.1K

7,015

03 Dec 2012