66

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Meituan LongCat Team
Bin Xiao
Chao Wang
Chengjiang Li
Chi Zhang
Chong Peng
Hang Yu
Hao Yang
Haonan Yan
Haoze Sun
Haozhe Zhao
Hong Liu
Hui Su
Jiaqi Zhang
Jiawei Wang
Jing Li
Kefeng Zhang
Manyuan Zhang
Minhao Jing
Peng Pei
Quan Chen
Taofeng Xue
Tongxin Pan
Xiaotong Li
Xiaoyang Li
Xiaoyu Zhao
Xing Hu
Xinyang Lin
Xunliang Cai
Yan Bai
Yan Feng
Yanjie Li
Yao Qiu
Yerui Sun
Yifan Lu
Ying Luo
Yipeng Mei
Yitian Chen
Yuchen Xie
Yufang Liu
Yufei Chen
Yulei Qian
Yuqi Peng
Zhihang Yu
Zhixiong Han
Changran Wang
Chen Chen
Dian Zheng
Fengjiao Chen
Ge Yang
Haowei Guo
Haozhe Wang
Hongyu Li
Huicheng Jiang
Jiale Hong
Jialv Zou
Jiamu Li
Jianping Lin
Jiaxing Liu
Jie Yang
Jing Jin
Jun Kuang
Juncheng She
Kunming Luo
Kuofeng Gao
Lin Qiu
Linsen Guo
Mianqiu Huang
Qi Li
Qian Wang
Rumei Li
Siyu Ren
Wei Wang
Wenlong He
Xi Chen
Xiao Liu
Xiaoyu Li
Xu Huang
Xuanyu Zhu
Xuezhi Cao
Yaoming Zhu
Yifei Cao
Yimeng Jia
Yizhen Jiang
Yufei Gao
Zeyang Hu
Zhenlong Yuan
Zijian Zhang
Ziwen Wang
Main:31 Pages
27 Figures
Bibliography:6 Pages
10 Tables
Appendix:6 Pages
Abstract

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub:this https URL

View on arXiv
Comments on this paper