ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.12530
68
2

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

18 February 2025
Xinyi Yang
Liang Zeng
Heng Dong
C. Yu
X. Wu
H. Yang
Yu Wang
Milind Tambe
Tonghan Wang
ArXivPDFHTML
Abstract

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain their policies in natural language will be vital for reliable coexistence. In this paper, we build a model-agnostic explanation generator based on an LLM. The technical novelty is that the rewards for training this LLM are generated by a generative flow matching model. This model has a specially designed structure with a hidden layer merged with an LLM to harness the linguistic cues of explanations into generating appropriate rewards. Experiments on both RL and LLM tasks demonstrate that our method can generate dense and effective rewards while saving on expensive human feedback; it thus enables effective explanations and even improves the accuracy of the decisions in original tasks.

View on arXiv
@article{yang2025_2502.12530,
  title={ Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards },
  author={ Xinyi Yang and Liang Zeng and Heng Dong and Chao Yu and Xiaoran Wu and Huazhong Yang and Yu Wang and Milind Tambe and Tonghan Wang },
  journal={arXiv preprint arXiv:2502.12530},
  year={ 2025 }
}
Comments on this paper