We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available atthis https URL.
View on arXiv@article{huang2025_2503.11251, title={ Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model }, author={ Haoyang Huang and Guoqing Ma and Nan Duan and Xing Chen and Changyi Wan and Ranchen Ming and Tianyu Wang and Bo Wang and Zhiying Lu and Aojie Li and Xianfang Zeng and Xinhao Zhang and Gang Yu and Yuhe Yin and Qiling Wu and Wen Sun and Kang An and Xin Han and Deshan Sun and Wei Ji and Bizhu Huang and Brian Li and Chenfei Wu and Guanzhe Huang and Huixin Xiong and Jiaxin He and Jianchang Wu and Jianlong Yuan and Jie Wu and Jiashuai Liu and Junjing Guo and Kaijun Tan and Liangyu Chen and Qiaohui Chen and Ran Sun and Shanshan Yuan and Shengming Yin and Sitong Liu and Wei Chen and Yaqi Dai and Yuchu Luo and Zheng Ge and Zhisheng Guan and Xiaoniu Song and Yu Zhou and Binxing Jiao and Jiansheng Chen and Jing Li and Shuchang Zhou and Xiangyu Zhang and Yi Xiu and Yibo Zhu and Heung-Yeung Shum and Daxin Jiang }, journal={arXiv preprint arXiv:2503.11251}, year={ 2025 } }