36

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

Meng Xin
Sweta Priyadarshi
Jingyu Xin
Bilal Kartal
Aditya Vavre
Asma Kuriparambil Thekkumpate
Zijia Chen
Ameya Sunil Mahabaleshwarkar
Ido Shahaf
Akhiad Bercovich
Kinjal Patel
Suguna Varshini Velury
Chenjie Luo
Zhiyu Cheng
Jenny Chen
Chen-Han Yu
Wei Ping
Oleg Rybakov
Nima Tajbakhsh
Oluwatobi Olabiyi
Dusan Stosic
Di Wu
Song Han
Eric Chung
Sharath Turuvekere Sreenivas
Bryan Catanzaro
Yoshi Suhara
Tijmen Blankevoort
Huizi Mao
Main:12 Pages
3 Figures
Bibliography:2 Pages
10 Tables
Appendix:3 Pages
Abstract

This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.

View on arXiv
Comments on this paper