Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

4 June 2025

Apurv Verma

Nhathai Phan

Shubhendu Trivedi

WaLM

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Main:9 Pages

29 Figures

Bibliography:7 Pages

4 Tables

Appendix:13 Pages

Abstract

Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives.

View on arXiv

Comments on this paper