When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs

11 June 2025

Main:9 Pages

35 Figures

Bibliography:2 Pages

7 Tables

Appendix:7 Pages

Abstract

We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.

View on arXiv

@article{li2025_2506.10095,
  title={ When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs },
  author={ Xiao Li and Joel Kreuzwieser and Alan Peters },
  journal={arXiv preprint arXiv:2506.10095},
  year={ 2025 }
}

Comments on this paper