Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

29 May 2025

Main:9 Pages

7 Figures

Bibliography:2 Pages

9 Tables

Appendix:6 Pages

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5 $\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at:this https URL.

View on arXiv

@article{zhang2025_2505.23277,
  title={ Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective },
  author={ Yong Zhang and Yanwen Huang and Ning Cheng and Yang Guo and Yun Zhu and Yanmeng Wang and Shaojun Wang and Jing Xiao },
  journal={arXiv preprint arXiv:2505.23277},
  year={ 2025 }
}

Comments on this paper