Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs

7 March 2025

Abstract

We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.2%, 4.7%, and 3.2% over the baseline for Mistral, Llama, and Qwen, respectively. Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender. By leveraging their collective strength, SVE outperforms individual steering vectors in both bias reduction and maintaining model performance. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that SVE is a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.

View on arXiv

@article{siddique2025_2503.05371,
  title={ Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs },
  author={ Zara Siddique and Irtaza Khalid and Liam D. Turner and Luis Espinosa-Anke },
  journal={arXiv preprint arXiv:2503.05371},
  year={ 2025 }
}

Comments on this paper