Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
- WaLM

Main:9 Pages
29 Figures
Bibliography:7 Pages
4 Tables
Appendix:13 Pages
Abstract
Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives.
View on arXivComments on this paper
