90
0

Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

Main:8 Pages
16 Figures
Bibliography:4 Pages
9 Tables
Appendix:8 Pages
Abstract

Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.

View on arXiv
@article{wang2025_2505.13328,
  title={ Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges },
  author={ Hongru Wang and Wenyu Huang and Yufei Wang and Yuanhao Xi and Jianqiao Lu and Huan Zhang and Nan Hu and Zeming Liu and Jeff Z. Pan and Kam-Fai Wong },
  journal={arXiv preprint arXiv:2505.13328},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.