OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

20 April 2026

Lei Zhu

Xing Cai

Yingjie Chen

Yiheng Li

Binxin Yang

Hao Liu

Jie Chen

Chen Li

Jing LYu

EGVM

VGen

VLM

ArXiv (abs)PDF HTML Github

Main:15 Pages

7 Figures

Bibliography:4 Pages

4 Tables

Abstract

Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.

View on arXiv

Comments on this paper