Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.
View on arXiv@article{bortoletto2025_2406.17513, title={ Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models }, author={ Matteo Bortoletto and Constantin Ruhdorfer and Lei Shi and Andreas Bulling }, journal={arXiv preprint arXiv:2406.17513}, year={ 2025 } }