ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.23810
24
0

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

27 May 2025
Chenghao Yang
Yinbo Luo
Zhoufutu Wen
Qi Chu
Tao Gong
L. J. Liu
Kaiyuan Zhang
Jianpeng Jiao
Ge Zhang
Wenhao Huang
Nenghai Yu
    LLMAGLRM
ArXiv (abs)PDFHTML
Main:12 Pages
10 Figures
Bibliography:1 Pages
6 Tables
Appendix:16 Pages
Abstract

Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

View on arXiv
@article{yang2025_2505.23810,
  title={ MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation },
  author={ Chenghao Yang and Yinbo Luo and Zhoufutu Wen and Qi Chu and Tao Gong and Longxiang Liu and Kaiyuan Zhang and Jianpeng Jiao and Ge Zhang and Wenhao Huang and Nenghai Yu },
  journal={arXiv preprint arXiv:2505.23810},
  year={ 2025 }
}
Comments on this paper