Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

1 July 2024

Philippe Laban

Alexander R. Fabbri

Caiming Xiong

Chien-Sheng Wu

RALM

ArXiv PDF HTML

Papers citing "Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems"

36 / 36 papers shown

Title
LLMs Get Lost In Multi-Turn Conversation Philippe Laban Hiroaki Hayashi Yingbo Zhou Jennifer Neville 42 1 0 09 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks Yixin Cao Shibo Hong X. Li Jiahao Ying Yubo Ma ... Juanzi Li Aixin Sun Xuanjing Huang Tat-Seng Chua Yu Jiang ALM ELM 84 1 0 26 Apr 2025
Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization Adithya Pratapa Teruko Mitamura RALM 34 0 0 17 Apr 2025
ML For Hardware Design Interpretability: Challenges and Opportunities Raymond Baartmans Andrew Ensinger Victor Agostinelli Lizhong Chen 29 0 0 11 Apr 2025
Reasoning Beyond Limits: Advances and Open Problems for LLMs M. Ferrag Norbert Tihanyi Merouane Debbah ELM OffRL LRM AI4CE 131 2 0 26 Mar 2025
Extract, Match, and Score: An Evaluation Paradigm for Long Question-context-answer Triplets in Financial Analysis Bo Hu Han Yuan Vlad Pandelea Wuqiong Luo Yingzhu Zhao Zheng Ma 53 0 0 20 Mar 2025
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings Austin Xu Srijan Bansal Yifei Ming Semih Yavuz Shafiq R. Joty ELM 95 3 0 19 Mar 2025
RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning through RAG and Incremental Knowledge Graph Learning Integration Hong Qing Yu Frank McQuade 48 1 0 14 Mar 2025
Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation Junhao Zhang Richong Zhang Fanshuang Kong Ziyang Miao Yanhan Ye Yaowei Zheng SyDa 46 0 0 10 Mar 2025
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning Zhibin Lan Liqiang Niu Fandong Meng Jie Zhou Jinsong Su VLM 69 1 0 04 Mar 2025
U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack Yunfan Gao Yun Xiong Wenlong Wu Zijing Huang Bohan Li H. Wang 52 3 0 01 Mar 2025
Do Retrieval-Augmented Language Models Adapt to Varying User Needs? Peilin Wu Xinlu Zhang Wenhao Yu Xingyu Liu Xinya Du Zhiyu Zoey Chen RALM 45 0 0 27 Feb 2025
Evaluating the Effect of Retrieval Augmentation on Social Biases Tianhui Zhang Yi Zhou Danushka Bollegala 38 0 0 24 Feb 2025
Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches Adithya Pratapa Teruko Mitamura 94 1 0 10 Feb 2025
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap Gopi Krishnan Rajbahadur G. Oliva Dayi Lin Ahmed E. Hassan 46 1 0 28 Jan 2025
Understanding Synthetic Context Extension via Retrieval Heads Xinyu Zhao Fangcong Yin Greg Durrett 41 0 0 31 Dec 2024
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? Jonathan Roberts Kai Han Samuel Albanie LLMAG 136 0 0 07 Nov 2024
Long Context RAG Performance of Large Language Models Quinn Leng Jacob P. Portes Sam Havens Matei A. Zaharia Michael Carbin AIFin RALM 3DV 41 8 0 05 Nov 2024
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments Kung-Hsiang Huang Akshara Prabhakar Sidharth Dhawan Yixin Mao Huan Wang Silvio Savarese Caiming Xiong Philippe Laban C. Wu 44 7 0 04 Nov 2024
On Positional Bias of Faithfulness for Long-form Summarization David Wan Jesse Vig Mohit Bansal Shafiq R. Joty HILM 48 3 0 31 Oct 2024
Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage Kaige Xie Philippe Laban Prafulla Kumar Choubey Caiming Xiong C. Wu 31 1 0 20 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization Catarina G. Belem Pouya Pezeskhpour Hayate Iso Seiji Maekawa Nikita Bhutani Estevam R. Hruschka HILM 65 1 0 17 Oct 2024
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses Pranav Narayanan Venkit Philippe Laban Yilun Zhou Yixin Mao C. Wu ELM 32 7 0 15 Oct 2024
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data Seiji Maekawa Hayate Iso Nikita Bhutani RALM 103 1 0 15 Oct 2024
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs Lei Wang Shan Dong Yuhui Xu Hanze Dong Yalu Wang Amrita Saha Ee-Peng Lim Caiming Xiong Doyen Sahoo LRM 40 1 0 07 Oct 2024
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly Howard Yen Tianyu Gao Minmin Hou Ke Ding Daniel Fleischer Peter Izsak Moshe Wasserblat Danqi Chen ALM ELM 62 25 0 03 Oct 2024
DeFine: Enhancing LLM Decision-Making with Factor Profiles and Analogical Reasoning Yebowen Hu Xiaoyang Wang Wenlin Yao Yiming Lu Daoan Zhang H. Foroosh Dong Yu Fei Liu 34 4 0 02 Oct 2024
Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks Zi Yang 30 0 0 10 Sep 2024
The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models Karime Maamari Fadhil Abubaker Daniel Jaroslawicz Amine Mhedhbi LRM 52 25 0 14 Aug 2024
Introducing a new hyper-parameter for RAG: Context Window Utilization Kush Juvekar A. Purwar 40 3 0 29 Jul 2024
Perceptions of Linguistic Uncertainty by Language Models and Humans Catarina G Belém Markelle Kelly M. Steyvers Sameer Singh Padhraic Smyth 43 3 0 22 Jul 2024
One Thousand and One Pairs: A "novel" challenge for long-context language models Marzena Karpinska Katherine Thai Kyle Lo Tanya Goyal Mohit Iyyer LRM 41 40 0 24 Jun 2024
MileBench: Benchmarking MLLMs in Long Context Dingjie Song Shunian Chen Guiming Hardy Chen Fei Yu Xiang Wan Benyou Wang VLM 78 34 0 29 Apr 2024
Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs Nan Hu Jiaoyan Chen Yike Wu Guilin Qi Sheng Bi Tongtong Wu Jeff Z. Pan HILM 37 8 0 26 Jan 2024
M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models Wai-Chung Kwan Xingshan Zeng Yufei Wang Yusen Sun Liangyou Li Lifeng Shang Qun Liu Kam-Fai Wong ELM 89 10 0 30 Oct 2023
Teaching Machines to Read and Comprehend Karl Moritz Hermann Tomás Kociský Edward Grefenstette L. Espeholt W. Kay Mustafa Suleyman Phil Blunsom 175 3,509 0 10 Jun 2015