ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.17801
10
0

Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling

23 July 2025
Yi Xin
Juncheng Yan
Qi Qin
Zhen Li
Dongyang Liu
Shicheng Li
Victor Shea-Jay Huang
Yupeng Zhou
Renrui Zhang
Le Zhuo
Tiancheng Han
Xiaoqing Sun
Siqi Luo
Mengmeng Wang
Bin Fu
Yuewen Cao
Hongsheng Li
Guangtao Zhai
Xiaohong Liu
Yu Qiao
Peng Gao
ArXiv (abs)PDFHTML
Main:17 Pages
11 Figures
Bibliography:6 Pages
7 Tables
Abstract

We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models atthis https URL.

View on arXiv
Comments on this paper