65
0

DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

Main:7 Pages
5 Figures
Bibliography:3 Pages
12 Tables
Appendix:11 Pages
Abstract

Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

View on arXiv
@article{jung2025_2504.02882,
  title={ DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models },
  author={ Sunghee Jung and Donghun Lee and Shinbok Lee and Gaeun Seo and Daniel Lee and Byeongil Ko and Junrae Cho and Kihyun Kim and Eunggyun Kim and Myeongcheol Shin },
  journal={arXiv preprint arXiv:2504.02882},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.