13
0

UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions

Main:9 Pages
14 Figures
Bibliography:3 Pages
5 Tables
Appendix:7 Pages
Abstract

The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model. The growing demand for video applications sets higher requirements for high-quality video generation models. For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content. However, the existing public datasets cannot support related research and applications. In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: \textit{i)} collection of diverse and high-quality video clips. \textit{ii)} statistical data filtering. \textit{iii)} model-based data purification. \textit{iv)} generation of comprehensive, structured captions. In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our datathis http URLbelieve that this work can make a significant contribution to future research on UHD video generation. UltraVideo dataset and UltraWan models are available atthis https URL.

View on arXiv
@article{xue2025_2506.13691,
  title={ UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions },
  author={ Zhucun Xue and Jiangning Zhang and Teng Hu and Haoyang He and Yinan Chen and Yuxuan Cai and Yabiao Wang and Chengjie Wang and Yong Liu and Xiangtai Li and Dacheng Tao },
  journal={arXiv preprint arXiv:2506.13691},
  year={ 2025 }
}
Comments on this paper