ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.02178
24
0

Cocktail-Party Audio-Visual Speech Recognition

2 June 2025
Thai-Binh Nguyen
Ngoc-Quan Pham
Alexander Waibel
ArXiv (abs)PDFHTML
Main:4 Pages
2 Figures
Bibliography:1 Pages
3 Tables
Abstract

Audio-Visual Speech Recognition (AVSR) offers a robust solution for speech recognition in challenging environments, such as cocktail-party scenarios, where relying solely on audio proves insufficient. However, current AVSR models are often optimized for idealized scenarios with consistently active speakers, overlooking the complexities of real-world settings that include both speaking and silent facial segments. This study addresses this gap by introducing a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems and highlight the limitations of prior approaches in realistic noisy conditions. Additionally, we contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments. Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.

View on arXiv
@article{nguyen2025_2506.02178,
  title={ Cocktail-Party Audio-Visual Speech Recognition },
  author={ Thai-Binh Nguyen and Ngoc-Quan Pham and Alexander Waibel },
  journal={arXiv preprint arXiv:2506.02178},
  year={ 2025 }
}
Comments on this paper