EgoNormia: Benchmarking Physical Social Norm Understanding

Human activity is moderated by norms. However, machines are often trained without explicit supervision on norm understanding and reasoning, particularly when norms are physically- or socially-grounded. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present \dataset{} , consisting of 1,853 challenging, multi-stage MCQ questions based on ego-centric videos of human interactions, evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 54\% on \dataset{} (versus a human bench of 92\%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation (RAG) method, it is possible to use \dataset{} to enhance normative reasoning in VLMs.
View on arXiv@article{rezaei2025_2502.20490, title={ EgoNormia: Benchmarking Physical Social Norm Understanding }, author={ MohammadHossein Rezaei and Yicheng Fu and Phil Cuvin and Caleb Ziems and Yanzhe Zhang and Hao Zhu and Diyi Yang }, journal={arXiv preprint arXiv:2502.20490}, year={ 2025 } }