ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.13708
74
0

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

18 December 2024
Taein Son
Soo Won Seo
Jisong Kim
S. Lee
Jun Won Choi
    VGen
ArXivPDFHTML
Abstract

Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks, including AVA, UCF101-24, and JHMDB51-21, demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.

View on arXiv
@article{son2025_2412.13708,
  title={ JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts },
  author={ Taein Son and Soo Won Seo and Jisong Kim and Seok Hwan Lee and Jun Won Choi },
  journal={arXiv preprint arXiv:2412.13708},
  year={ 2025 }
}
Comments on this paper