ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.15107
132
0
v1v2 (latest)

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

21 May 2025
Ziliang Wang
Xuhui Zheng
Kang An
Cijun Ouyang
Jialu Cai
Yuhang Wang
Yichao Wu
    LRM
ArXiv (abs)PDFHTML
Main:10 Pages
6 Figures
Bibliography:4 Pages
9 Tables
Appendix:6 Pages
Abstract

Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our code will be released onthis https URL.

View on arXiv
@article{wang2025_2505.15107,
  title={ StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization },
  author={ Ziliang Wang and Xuhui Zheng and Kang An and Cijun Ouyang and Jialu Cai and Yuhang Wang and Yichao Wu },
  journal={arXiv preprint arXiv:2505.15107},
  year={ 2025 }
}
Comments on this paper