LongCat-Next: Lexicalizing Modalities as Discrete Tokens

29 March 2026

Meituan LongCat Team

Bin Xiao

Chao Wang

Chengjiang Li

Chi Zhang

Chong Peng

Hang Yu

Hao Yang

Haonan Yan

Haoze Sun

Haozhe Zhao

Hong Liu

Hui Su

Jiaqi Zhang

Jiawei Wang

Jing Li

Kefeng Zhang

Manyuan Zhang

Minhao Jing

Peng Pei

Quan Chen

Taofeng Xue

Tongxin Pan

Xiaotong Li

Xiaoyang Li

Xiaoyu Zhao

Xing Hu

Xinyang Lin

Xunliang Cai

Yan Bai

Yan Feng

Yanjie Li

Yao Qiu

Yerui Sun

Yifan Lu

Ying Luo

Yipeng Mei

Yitian Chen

Yuchen Xie

Yufang Liu

Yufei Chen

Yulei Qian

Yuqi Peng

Zhihang Yu

Zhixiong Han

Changran Wang

Chen Chen

Dian Zheng

Fengjiao Chen

Ge Yang

Haowei Guo

Haozhe Wang

Hongyu Li

Huicheng Jiang

Jiale Hong

Jialv Zou

Jiamu Li

Jianping Lin

Jiaxing Liu

Jie Yang

Jing Jin

Jun Kuang

Juncheng She

Kunming Luo

Kuofeng Gao

Lin Qiu

Linsen Guo

Mianqiu Huang

Qi Li

Qian Wang

Rumei Li

Siyu Ren

Wei Wang

Wenlong He

Xi Chen

Xiao Liu

Xiaoyu Li

Xu Huang

Xuanyu Zhu

Xuezhi Cao

Yaoming Zhu

Yifei Cao

Yimeng Jia

Yizhen Jiang

Yufei Gao

Zeyang Hu

Zhenlong Yuan

Zijian Zhang

Ziwen Wang

MLLM

ArXiv (abs)PDF HTML Github (280★)

Main:31 Pages

27 Figures

Bibliography:6 Pages

10 Tables

Appendix:6 Pages

Abstract

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub:this https URL

View on arXiv

Comments on this paper