Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

17 June 2025

Chenchen Yuan

Main:8 Pages

14 Figures

Bibliography:3 Pages

10 Tables

Appendix:7 Pages

Abstract

Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.

View on arXiv

@article{yuan2025_2506.14625,
  title={ Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models },
  author={ Chenchen Yuan and Zheyu Zhang and Shuo Yang and Bardh Prenkaj and Gjergji Kasneci },
  journal={arXiv preprint arXiv:2506.14625},
  year={ 2025 }
}

Comments on this paper