Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2409.18869
Cited By
Emu3: Next-Token Prediction is All You Need
27 September 2024
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
Jinsheng Wang
Fan Zhang
Yueze Wang
Zhen Li
Qiying Yu
Yingli Zhao
Yulong Ao
Xuebin Min
Tao Li
Boya Wu
Bo Zhao
Bowen Zhang
Liangdong Wang
Guang Liu
Zheqi He
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Emu3: Next-Token Prediction is All You Need"
47 / 47 papers shown
Title
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Yicheng Xiao
Lin Song
Yukang Chen
Yingmin Luo
Y. Chen
Yukang Gan
Wei Huang
Xiu Li
Xiaojuan Qi
Ying Shan
LRM
12
0
0
19 May 2025
Visually Interpretable Subtask Reasoning for Visual Question Answering
Yu Cheng
A. Goel
Hakan Bilen
LRM
31
0
0
12 May 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin
Teng Wang
Yixiao Ge
Yuying Ge
Zhichao Lu
Ying Wei
Qingfu Zhang
Zhenan Sun
Ying Shan
MLLM
VLM
70
0
0
08 May 2025
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu
Gongye Liu
Jiajun Liang
Yong Li
Jiaheng Liu
Xinyu Wang
Pengfei Wan
Di Zhang
Wanli Ouyang
AI4CE
68
0
0
08 May 2025
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Biao Gong
Cheng Zou
Dandan Zheng
Hu Yu
Jingdong Chen
...
Qingpei Guo
Rui Liu
Weilong Chai
Xinyu Xiao
Ziyuan Huang
MLLM
79
1
0
05 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Xuzhi Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
74
0
0
05 May 2025
Improving Physical Object State Representation in Text-to-Image Generative Systems
Tianle Chen
Chaitanya Chakka
Deepti Ghadiyaram
34
0
0
04 May 2025
Generating Animated Layouts as Structured Text Representations
Yeonsang Shin
Jihwan Kim
Yumin Song
Kyungseung Lee
Hyunhee Chung
Taeyoung Na
DiffM
VGen
70
0
0
02 May 2025
WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation
D. Zhang
Che Jiang
Ruoshi Xu
Biaoxiang Chen
Zijian Jin
Yutian Lu
Jianguo Zhang
Liang Yong
Jiebo Luo
Shengda Luo
VLM
47
0
0
02 May 2025
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
D. Jiang
Ziyu Guo
Renrui Zhang
Zhuofan Zong
Hao Li
Le Zhuo
Shilin Yan
Pheng-Ann Heng
Yiming Li
LRM
72
2
0
01 May 2025
A Survey of Interactive Generative Video
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
Xinyu Wang
Pengfei Wan
Di Zhang
Kun Gai
Hao Chen
Xihui Liu
VGen
65
0
0
30 Apr 2025
Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing
Hong Zhang
Zhongjie Duan
Xingjun Wang
Yuze Zhao
Weiyi Lu
Zhipeng Di
Yongjun Xu
Yingda Chen
Yu Zhang
MLLM
94
1
0
30 Apr 2025
X-Fusion: Introducing New Modality to Frozen Large Language Models
Sicheng Mo
Thao Nguyen
Xun Huang
Siddharth Srinivasan Iyer
Yijun Li
...
Eli Shechtman
Krishna Kumar Singh
Yong Jae Lee
Bolei Zhou
Yuheng Li
77
0
0
29 Apr 2025
YoChameleon: Personalized Vision and Language Generation
Thao Nguyen
Krishna Kumar Singh
Jing Shi
Trung H. Bui
Yong Jae Lee
Yuheng Li
MLLM
82
0
0
29 Apr 2025
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
Xu Ma
Peize Sun
Haoyu Ma
Hao Tang
Chih-Yao Ma
...
Matt Feiszli
Peizhao Zhang
Peter Vajda
Sam S. Tsai
Y. Fu
73
1
0
24 Apr 2025
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Kaihang Pan
Wang Lin
Zhongqi Yue
Tenglong Ao
Liyu Jia
Wei Zhao
Juncheng Billy Li
Siliang Tang
Hanwang Zhang
52
2
0
20 Apr 2025
Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision
Yuandong Pu
Le Zhuo
Kaiwen Zhu
Liangbin Xie
Wenlong Zhang
Xiangyu Chen
Peng Gao
Yu Qiao
Chao Dong
Yihao Liu
MLLM
69
1
0
07 Apr 2025
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Zhiyuan Yan
Junyan Ye
Weijia Li
Zilong Huang
Shenghai Yuan
Xiangyang He
Kaiqing Lin
Jun-Jian He
Conghui He
Li Yuan
MLLM
EGVM
93
10
0
03 Apr 2025
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Xiangyu Zhao
Peiyuan Zhang
Kexian Tang
Hao Li
Zicheng Zhang
Guangtao Zhai
Junchi Yan
Hua Yang
Xue Yang
Haodong Duan
VLM
LRM
46
0
0
03 Apr 2025
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Hang Guo
Yawei Li
Taolin Zhang
Jiadong Wang
Tao Dai
Shu-Tao Xia
Luca Benini
72
2
0
30 Mar 2025
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu
Feiyu Xiong
Lumin Xu
Sheng Jin
Zhonghua Wu
Qingyi Tao
Wentao Liu
Wei Li
Chen Change Loy
VGen
177
2
0
27 Mar 2025
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Wei Song
Yansen Wang
Zijia Song
Yadong Li
Haoze Sun
Xin Wu
Zenan Zhou
Jianhua Xu
Jiaqi Wang
Kaicheng Yu
60
2
0
18 Mar 2025
BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries
Tianle Li
Yongming Rao
Winston Hu
Yu Cheng
MLLM
68
0
0
16 Mar 2025
Direction-Aware Diagonal Autoregressive Image Generation
Yijia Xu
Jianzhong Ju
Jian Luan
J. Cui
57
0
0
14 Mar 2025
PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
Runze He
Bo Cheng
Yuhang Ma
Qingxiang Jia
Shanyuan Liu
Ao Ma
Xiaoyu Wu
Liebucha Wu
Dawei Leng
Yuhui Yin
DiffM
VLM
54
0
0
13 Mar 2025
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
Xuying Zhang
Yutong Liu
Yangguang Li
Renrui Zhang
Yong Liu
...
Wanli Ouyang
Zhiwei Xiong
Peng Gao
Qibin Hou
Ming-Ming Cheng
127
3
0
13 Mar 2025
ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models
Zicheng Ma
Chuanliu Fan
Zhicong Wang
Zhenyu Chen
Xiaohan Lin
Yong Li
Shihao Feng
Jun Zhang
Ziqiang Cao
Y. Gao
48
0
0
11 Mar 2025
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Xing Xie
Jiawei Liu
Ziyue Lin
Huijie Fan
Zhi Han
Yandong Tang
Liangqiong Qu
47
0
0
10 Mar 2025
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Zhenpeng Chen
Chunwei Wang
Xiuwei Chen
Hang Xu
J. Han
Xiandan Liang
VLM
71
1
0
09 Mar 2025
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
MLLM
VLM
LRM
109
6
0
21 Jan 2025
VideoAuteur: Towards Long Narrative Video Generation
Junfei Xiao
Feng Cheng
Lu Qi
Liangke Gui
Jiepeng Cen
Zhibei Ma
Alan L. Yuille
Lu Jiang
VGen
58
2
0
10 Jan 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
Xianrui Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming Yang
VLM
96
12
0
07 Jan 2025
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Chenxin Tao
Shiqian Su
X. Zhu
Chenyu Zhang
Zhe Chen
...
Wenhai Wang
Lewei Lu
Gao Huang
Yu Qiao
Jifeng Dai
MLLM
VLM
104
2
0
20 Dec 2024
Parallelized Autoregressive Visual Generation
Yanjie Wang
Shuhuai Ren
Zhijie Lin
Yujin Han
Haoyuan Guo
Zhenheng Yang
Difan Zou
Jiashi Feng
Xihui Liu
VGen
90
12
0
19 Dec 2024
Next Patch Prediction for Autoregressive Visual Generation
Yatian Pang
Peng Jin
Shuo Yang
Bin Lin
Bin Zhu
...
Liuhan Chen
Francis E. H. Tay
Ser-Nam Lim
Harry Yang
Li Yuan
123
9
0
19 Dec 2024
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer
Hongyu Chen
Zihan Wang
Xianrui Li
Xingchen Sun
Fangyi Chen
Jiang Liu
Jiadong Wang
Bhiksha Raj
Zicheng Liu
Emad Barsoum
VLM
114
7
0
14 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip Torr
VLM
ObjD
221
0
0
12 Dec 2024
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Baorui Ma
Huachen Gao
Haoge Deng
Zhengxiong Luo
Tiejun Huang
Lulu Tang
Xinlong Wang
DiffM
VGen
122
14
0
09 Dec 2024
Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression
Junjie Wen
Minjie Zhu
Yichen Zhu
Zhibin Tang
Jinming Li
...
Chengmeng Li
Xiaoyu Liu
Yaxin Peng
Chaomin Shen
Feifei Feng
88
15
0
04 Dec 2024
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Anton Voronov
Denis Kuznedelev
Mikhail Khoroshikh
Valentin Khrulkov
Dmitry Baranchuk
111
2
0
02 Dec 2024
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffM
VGen
VLM
135
6
0
28 Nov 2024
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Yongwei Chen
Yushi Lan
Shangchen Zhou
Tengfei Wang
Xingang Pan
102
5
0
25 Nov 2024
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
Teng Zhou
Xiaoyu Zhang
Yongchuan Tang
MLLM
DiffM
95
0
0
24 Nov 2024
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Weiyun Wang
Zhe Chen
Wenhai Wang
Yue Cao
Yangzhou Liu
...
Jinguo Zhu
X. Zhu
Lewei Lu
Yu Qiao
Jifeng Dai
LRM
62
48
1
15 Nov 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLM
MLLM
65
25
0
10 Oct 2024
ElasticTok: Adaptive Tokenization for Image and Video
Wilson Yan
Matei A. Zaharia
Volodymyr Mnih
Pieter Abbeel
Aleksandra Faust
Hao Liu
VGen
49
6
0
10 Oct 2024
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen
Yunhao Gou
Runhui Huang
Zhili Liu
Daxin Tan
...
Qun Liu
Jun Yao
Lu Hou
Hang Xu
Hang Xu
AuLLM
MLLM
VLM
82
21
0
26 Sep 2024
1