A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models

27 January 2026

Iwona Christop

Mateusz Czyżnikiewicz

Paweł Skórzewski

Łukasz Bondaruk

Jakub Kubiak

Marcin Lewandowski

Marek Kubis

AuLLM

ArXiv (abs)PDF HTML

Main:8 Pages

3 Figures

Bibliography:3 Pages

16 Tables

Appendix:20 Pages

Abstract

The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.

View on arXiv

Comments on this paper