ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.20276
47
0
v1v2 (latest)

Does quantization affect models' performance on long-context tasks?

26 May 2025
Anmol Mekala
Anirudh Atmakuru
Yixiao Song
Marzena Karpinska
Mohit Iyyer
    MQ
ArXiv (abs)PDFHTML
Main:8 Pages
29 Figures
Bibliography:6 Pages
27 Tables
Appendix:31 Pages
Abstract

Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.

View on arXiv
@article{mekala2025_2505.20276,
  title={ Does quantization affect models' performance on long-context tasks? },
  author={ Anmol Mekala and Anirudh Atmakuru and Yixiao Song and Marzena Karpinska and Mohit Iyyer },
  journal={arXiv preprint arXiv:2505.20276},
  year={ 2025 }
}
Comments on this paper