Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

10 April 2026

Siyuan Xu

Shiyang Li

Xin Liu

Tianyi Liu

Yixiao Li

Zhan Shi

Zixuan Zhang

Zilong Wang

Qingyu Yin

Jianshu Chen

Tuo Zhao

Bing Yin

OffRL

SyDa

ArXiv (abs)PDF HTML Github (20681★)

Main:8 Pages

10 Figures

Bibliography:4 Pages

10 Tables

Appendix:8 Pages

Abstract

Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.

View on arXiv

Comments on this paper