Show-o2: Improved Native Unified Multimodal Models

18 June 2025

Jinheng Xie

Zhenheng Yang

Mike Zheng Shou

Author Contacts:

VGen

ArXiv (abs)PDF HTML

Main:13 Pages

3 Figures

Bibliography:6 Pages

12 Tables

Abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released atthis https URL.

View on arXiv

@article{xie2025_2506.15564,
  title={ Show-o2: Improved Native Unified Multimodal Models },
  author={ Jinheng Xie and Zhenheng Yang and Mike Zheng Shou },
  journal={arXiv preprint arXiv:2506.15564},
  year={ 2025 }
}

Comments on this paper