LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

18 February 2025

Abstract

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

View on arXiv

@article{yang2025_2502.12583,
  title={ LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data },
  author={ Cehao Yang and Xueyuan Lin and Chengjin Xu and Xuhui Jiang and Shengjie Ma and Aofan Liu and Hui Xiong and Jian Guo },
  journal={arXiv preprint arXiv:2502.12583},
  year={ 2025 }
}

Comments on this paper