MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

6 June 2025

Main:8 Pages

15 Figures

Bibliography:2 Pages

7 Tables

Appendix:7 Pages

Abstract

Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

View on arXiv

@article{condez2025_2506.05696,
  title={ MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory },
  author={ Ana Carolina Condez and Diogo Tavares and João Magalhães },
  journal={arXiv preprint arXiv:2506.05696},
  year={ 2025 }
}

Comments on this paper