ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.13032
7
0

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

19 May 2025
Ziyang Ma
Yinghao Ma
Yanqiao Zhu
Chen Yang
Yi-Wen Chao
Ruiyang Xu
Wenxi Chen
Yuanzhe Chen
Zhuo Chen
Jian Cong
Kai Li
Keliang Li
Siyou Li
Xinfeng Li
Xiquan Li
Zheng Lian
Yuzhe Liang
Minghao Liu
Zhikang Niu
Tianrui Wang
Yuping Wang
Yuxuan Wang
Y. Wu
Guanrou Yang
Jianwei Yu
Ruibin Yuan
Zhisheng Zheng
Ziya Zhou
Haina Zhu
Wei Xue
Emmanouil Benetos
Kai Yu
Eng Siong Chng
Xie Chen
    AuLLM
    LRM
ArXivPDFHTML
Abstract

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

View on arXiv
@article{ma2025_2505.13032,
  title={ MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix },
  author={ Ziyang Ma and Yinghao Ma and Yanqiao Zhu and Chen Yang and Yi-Wen Chao and Ruiyang Xu and Wenxi Chen and Yuanzhe Chen and Zhuo Chen and Jian Cong and Kai Li and Keliang Li and Siyou Li and Xinfeng Li and Xiquan Li and Zheng Lian and Yuzhe Liang and Minghao Liu and Zhikang Niu and Tianrui Wang and Yuping Wang and Yuxuan Wang and Yihao Wu and Guanrou Yang and Jianwei Yu and Ruibin Yuan and Zhisheng Zheng and Ziya Zhou and Haina Zhu and Wei Xue and Emmanouil Benetos and Kai Yu and Eng-Siong Chng and Xie Chen },
  journal={arXiv preprint arXiv:2505.13032},
  year={ 2025 }
}
Comments on this paper