ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.12623
65
0

MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

16 March 2025
Vrushank Ahire
Kunal Shah
Mudasir Nazir Khan
Nikhil Pakhale
L. Sookha
M. A. Ganaie
Abhinav Dhall
ArXivPDFHTML
Abstract

Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at:this https URL

View on arXiv
@article{ahire2025_2503.12623,
  title={ MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network },
  author={ Vrushank Ahire and Kunal Shah and Mudasir Nazir Khan and Nikhil Pakhale and Lownish Rai Sookha and M. A. Ganaie and Abhinav Dhall },
  journal={arXiv preprint arXiv:2503.12623},
  year={ 2025 }
}
Comments on this paper