ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.05166
83
0
v1v2 (latest)

Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

5 June 2025
Bhavik Chandna
Zubair Bashir
Procheta Sen
ArXiv (abs)PDFHTML
Abstract

Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

View on arXiv
@article{chandna2025_2506.05166,
  title={ Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective },
  author={ Bhavik Chandna and Zubair Bashir and Procheta Sen },
  journal={arXiv preprint arXiv:2506.05166},
  year={ 2025 }
}
Main:9 Pages
10 Figures
Bibliography:4 Pages
6 Tables
Appendix:2 Pages
Comments on this paper