329
v1v2 (latest)

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Heyi Chen
Siyan Chen
Xin Chen
Yanfei Chen
Ying Chen
Zhuo Chen
Feng Cheng
Tianheng Cheng
Xinqi Cheng
Xuyan Chi
Jian Cong
Jing Cui
Qinpeng Cui
Qide Dong
Junliang Fan
Jing Fang
Zetao Fang
Chengjian Feng
Han Feng
Mingyuan Gao
Yu Gao
Dong Guo
Qiushan Guo
Boyang Hao
Qingkai Hao
Bibo He
Qian He
Tuyen Hoang
Ruoqing Hu
Xi Hu
Weilin Huang
Zhaoyang Huang
Zhongyi Huang
Donglei Ji
Siqi Jiang
Wei Jiang
Yunpu Jiang
Zhuo Jiang
Ashley Kim
Jianan Kong
Zhichao Lai
Shanshan Lao
Yichong Leng
Ai Li
Feiya Li
Gen Li
Huixia Li
JiaShi Li
Liang Li
Ming Li
Shanshan Li
Tao Li
Xian Li
Xiaojie Li
Xiaoyang Li
Xingxing Li
Yameng Li
Yifu Li
Yiying Li
Chao Liang
Han Liang
Jianzhong Liang
Ying Liang
Zhiqiang Liang
Wang Liao
Yalin Liao
Heng Lin
Kengyu Lin
Shanchuan Lin
Xi Lin
Zhijie Lin
Feng Ling
Fangfang Liu
Gaohong Liu
Jiawei Liu
Jie Liu
Jihao Liu
Shouda Liu
Shu Liu
Sichao Liu
Songwei Liu
Xin Liu
Xue Liu
Yibo Liu
Zikun Liu
Zuxi Liu
Junlin Lyu
Lecheng Lyu
Qian Lyu
Han Mu
Xiaonan Nie
Jingzhe Ning
Xitong Pan
Yanghua Peng
Lianke Qin
Xueqiong Qu
Yuxi Ren
Kai Shen
Guang Shi
Lei Shi
Main:8 Pages
6 Figures
Bibliography:1 Pages
Appendix:2 Pages
Abstract

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine atthis https URL.

View on arXiv
Comments on this paper