v1v2 (latest)

Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

10 June 2025

Main:8 Pages

5 Figures

Bibliography:5 Pages

3 Tables

Appendix:2 Pages

Abstract

Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

View on arXiv

@article{bhan2025_2506.09277,
  title={ Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models },
  author={ Milan Bhan and Jean-Noel Vittaut and Nicolas Chesneau and Sarath Chandar and Marie-Jeanne Lesot },
  journal={arXiv preprint arXiv:2506.09277},
  year={ 2025 }
}

Comments on this paper