MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory
- VLM

Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.
View on arXiv@article{condez2025_2506.05696, title={ MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory }, author={ Ana Carolina Condez and Diogo Tavares and João Magalhães }, journal={arXiv preprint arXiv:2506.05696}, year={ 2025 } }