WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale

23 February 2025

Jiaxi Li

Abstract

Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on short-context tasks without incorporating supplementary short-context data. By generating a more diverse and realistic long-context instruction dataset, WildLong enhances LLMs' ability to generalize to complex, real-world reasoning over long contexts, establishing a new paradigm for long-context data synthesis.

View on arXiv

@article{li2025_2502.16684,
  title={ WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale },
  author={ Jiaxi Li and Xingxing Zhang and Xun Wang and Xiaolong Huang and Li Dong and Liang Wang and Si-Qing Chen and Wei Lu and Furu Wei },
  journal={arXiv preprint arXiv:2502.16684},
  year={ 2025 }
}

Comments on this paper