ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.05737
  4. Cited By
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

9 October 2023
Lijun Yu
José Lezama
N. B. Gundavarapu
Luca Versari
Kihyuk Sohn
David C. Minnen
Yong Cheng
Vighnesh Birodkar
Agrim Gupta
Xiuye Gu
Alexander G. Hauptmann
Boqing Gong
Ming-Hsuan Yang
Irfan Essa
David A. Ross
Lu Jiang
ArXivPDFHTML

Papers citing "Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation"

50 / 228 papers shown
Title
MaskBit: Embedding-free Image Generation via Bit Tokens
MaskBit: Embedding-free Image Generation via Bit Tokens
Mark Weber
Lijun Yu
Qihang Yu
XueQing Deng
Xiaohui Shen
Daniel Cremers
Liang-Chieh Chen
DiffM
51
30
0
24 Sep 2024
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Hong Chen
Xin Wang
Yuwei Zhou
Bin Huang
Yipeng Zhang
Wei Feng
Houlun Chen
Zeyang Zhang
Siao Tang
Wenwu Zhu
DiffM
55
7
0
23 Sep 2024
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
Zhuoyan Luo
Fengyuan Shi
Yixiao Ge
Yujiu Yang
Limin Wang
Ying Shan
VLM
50
52
0
06 Sep 2024
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video
  Diffusion Model
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
Liuhan Chen
Zongjian Li
Bin Lin
Bin Zhu
Qian Wang
Shenghai Yuan
X. Zhou
Xinhua Cheng
Li Yuan
DiffM
91
14
0
02 Sep 2024
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec
  Transformer
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Yuancheng Wang
Haoyue Zhan
Liwei Liu
Ruihong Zeng
Haotian Guo
Jiachen Zheng
Qiang Zhang
Shunsi Zhang
Shunsi Zhang
Zhizheng Wu
36
42
0
01 Sep 2024
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
Zanlin Ni
Yulin Wang
Renping Zhou
Rui Lu
Jiayi Guo
Jinyi Hu
Zhiyuan Liu
Yuan Yao
Gao Huang
37
7
0
31 Aug 2024
SurGen: Text-Guided Diffusion Model for Surgical Video Generation
SurGen: Text-Guided Diffusion Model for Surgical Video Generation
Joseph Cho
Samuel Schmidgall
C. Zakka
Mrudang Mathur
Dhamanpreet Kaur
R. Shad
W. Hiesinger
VGen
MedIm
31
6
0
26 Aug 2024
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
  Representations
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Can Qin
Congying Xia
Krithika Ramakrishnan
Michael S Ryoo
Lifu Tu
...
Silvio Savarese
Juan Carlos Niebles
Zeyuan Chen
Ran Xu
Caiming Xiong
VGen
DiffM
76
2
0
22 Aug 2024
Show-o: One Single Transformer to Unify Multimodal Understanding and
  Generation
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie
Weijia Mao
Zechen Bai
David Junhao Zhang
Weihao Wang
Kevin Qinghong Lin
Yuchao Gu
Zhijie Chen
Zhenheng Yang
Mike Zheng Shou
57
165
0
22 Aug 2024
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang
Jiayan Teng
Wendi Zheng
Ming Ding
Shiyu Huang
...
Weihan Wang
Yean Cheng
Xiaotao Gu
Yuxiao Dong
Jie Tang
DiffM
VGen
104
405
0
12 Aug 2024
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
Zhiyu Tan
Xiaomeng Yang
Luozheng Qin
Hao Li
VGen
58
16
0
05 Aug 2024
Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like
  Spontaneous Representation
Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
Xinhan Di
Jiahao Lu
Yunming Liang
Junjie Zheng
Yihua Wang
Chaofan Ding
ALM
35
1
0
01 Aug 2024
Tora: Trajectory-oriented Diffusion Transformer for Video Generation
Tora: Trajectory-oriented Diffusion Transformer for Video Generation
Zhenghao Zhang
Junchao Liao
Menghao Li
Zuozhuo Dai
Bingxue Qiu
Hao Hu
Shaowei Cai
Weizhi Wang
VGen
50
45
0
31 Jul 2024
Generating 3D House Wireframes with Semantics
Generating 3D House Wireframes with Semantics
Xueqi Ma
Yilin Liu
Wenjun Zhou
Ruowei Wang
Hui Huang
3DV
36
0
0
17 Jul 2024
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable
  Landmark Conditions
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
Zhiyuan Chen
Jiajiong Cao
Zhiquan Chen
Yuming Li
Chenguang Ma
VGen
40
49
0
11 Jul 2024
Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators
Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators
Wentao Zhang
Junliang Guo
Tianyu He
Li Zhao
Linli Xu
Jiang Bian
47
3
0
10 Jul 2024
Balance of Number of Embedding and their Dimensions in Vector
  Quantization
Balance of Number of Embedding and their Dimensions in Vector Quantization
Hang Chen
Sankepally Sainath Reddy
Ziwei Chen
Dianbo Liu
49
1
0
06 Jul 2024
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
William Berman
A. Peysakhovich
39
4
0
26 Jun 2024
Autoregressive Image Generation without Vector Quantization
Autoregressive Image Generation without Vector Quantization
Tianhong Li
Yonglong Tian
He Li
Mingyang Deng
Kaiming He
DiffM
62
183
0
17 Jun 2024
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
Tianren Ma
Lingxi Xie
Yunjie Tian
Boyu Yang
Yuan Zhang
44
0
0
17 Jun 2024
ControlVAR: Exploring Controllable Visual Autoregressive Modeling
ControlVAR: Exploring Controllable Visual Autoregressive Modeling
Xiang Li
Kai Qiu
Hao Chen
Jason Kuen
Zhe-nan Lin
Rita Singh
Bhiksha Raj
DiffM
43
21
0
14 Jun 2024
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
Junke Wang
Yi-Xin Jiang
Zehuan Yuan
Binyue Peng
Zuxuan Wu
Yu-Gang Jiang
ViT
VGen
80
38
0
13 Jun 2024
Hierarchical Patch Diffusion Models for High-Resolution Video Generation
Hierarchical Patch Diffusion Models for High-Resolution Video Generation
Ivan Skorokhodov
Willi Menapace
Aliaksandr Siarohin
Sergey Tulyakov
VGen
48
10
0
12 Jun 2024
An Image is Worth 32 Tokens for Reconstruction and Generation
An Image is Worth 32 Tokens for Reconstruction and Generation
Qihang Yu
Mark Weber
XueQing Deng
Xiaohui Shen
Daniel Cremers
Liang-Chieh Chen
VLM
ViT
60
85
0
11 Jun 2024
Image and Video Tokenization with Binary Spherical Quantization
Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao
Yuanjun Xiong
Philipp Krahenbuhl
45
18
0
11 Jun 2024
Autoregressive Model Beats Diffusion: Llama for Scalable Image
  Generation
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun
Yi Jiang
Shoufa Chen
Shilong Zhang
Bingyue Peng
Ping Luo
Zehuan Yuan
VLM
66
229
0
10 Jun 2024
Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis
Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis
Zanlin Ni
Yulin Wang
Renping Zhou
Jiayi Guo
Jinyi Hu
Zhiyuan Liu
Shiji Song
Yuan Yao
Gao Huang
37
14
0
08 Jun 2024
VideoPhy: Evaluating Physical Commonsense for Video Generation
VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal
Zongyu Lin
Tianyi Xie
Zeshun Zong
Michal Yarom
Yonatan Bitton
Chenfanfu Jiang
Ningyu Zhang
Kai-Wei Chang
Aditya Grover
EGVM
VGen
40
37
0
05 Jun 2024
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou
Jiawei Chen
J. Chen
Yuanzhe Chen
Zhuo Chen
...
Wenjie Zhang
Wenjie Qu
Zilin Zhao
Dejian Zhong
Xiaobin Zhuang
49
80
0
04 Jun 2024
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar
  Latent Transformer Diffusion Models
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
Dongchao Yang
Dingdong Wang
Haohan Guo
Xueyuan Chen
Xixin Wu
Helen M. Meng
67
26
0
04 Jun 2024
Promptus: Can Prompts Streaming Replace Video Streaming with Stable
  Diffusion
Promptus: Can Prompts Streaming Replace Video Streaming with Stable Diffusion
Jiangkai Wu
Liming Liu
Yunpeng Tan
Junlin Hao
Xinggong Zhang
43
2
0
30 May 2024
EasyAnimate: A High-Performance Long Video Generation Method based on
  Transformer Architecture
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
Jiaqi Xu
Xinyi Zou
Kunzhe Huang
Yunkuo Chen
Bo Liu
Mengli Cheng
Xing Shi
Jun Huang
VGen
40
37
0
29 May 2024
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Vicky Zayats
Peter Chen
Melissa Ferrari
Dirk Padfield
AI4CE
38
0
0
29 May 2024
Glauber Generative Model: Discrete Diffusion Models via Binary Classification
Glauber Generative Model: Discrete Diffusion Models via Binary Classification
Harshit Varma
Dheeraj M. Nagaraj
Karthikeyan Shanmugam
VLM
70
2
0
27 May 2024
ARVideo: Autoregressive Pretraining for Self-Supervised Video
  Representation Learning
ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning
Sucheng Ren
Hongru Zhu
Chen Wei
Yijiang Li
Alan Yuille
Cihang Xie
AI4TS
VGen
SSL
59
1
0
24 May 2024
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
Basile Van Hoorick
Rundi Wu
Ege Ozguroglu
Kyle Sargent
Ruoshi Liu
P. Tokmakov
Achal Dave
Changxi Zheng
Carl Vondrick
DiffM
VGen
58
29
0
23 May 2024
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Shiqi Yang
Zhi-Wei Zhong
Mengjie Zhao
Shusuke Takahashi
Masato Ishii
Takashi Shibuya
Yuki Mitsufuji
43
3
0
23 May 2024
A Versatile Diffusion Transformer with Mixture of Noise Levels for
  Audiovisual Generation
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
Gwanghyun Kim
Alonso Martinez
Yu-Chuan Su
Brendan Jou
José Lezama
...
Lijun Yu
Lu Jiang
A. Jansen
Jacob Walker
Krishna Somandepalli
32
8
0
22 May 2024
Libra: Building Decoupled Vision System on Large Language Models
Libra: Building Decoupled Vision System on Large Language Models
Yifan Xu
Xiaoshan Yang
Y. Song
Changsheng Xu
MLLM
VLM
43
7
0
16 May 2024
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species
  Genomic Sequence Modeling
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling
Siyuan Li
Zedong Wang
Zicheng Liu
Di Wu
Cheng Tan
Jiangbin Zheng
Yufei Huang
Stan Z. Li
40
7
0
13 May 2024
Is Sora a World Simulator? A Comprehensive Survey on General World
  Models and Beyond
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
Zheng Zhu
Xiaofeng Wang
Wangbo Zhao
Chen Min
Nianchen Deng
...
Dawei Zhao
Liang Xiao
Jian-jun Zhao
Jiwen Lu
Guan Huang
VGen
LM&Ro
87
38
0
06 May 2024
Auto-Encoding Morph-Tokens for Multimodal LLM
Auto-Encoding Morph-Tokens for Multimodal LLM
Kaihang Pan
Siliang Tang
Juncheng Li
Zhaoyu Fan
Wei Chow
Shuicheng Yan
Tat-Seng Chua
Yueting Zhuang
Hanwang Zhang
MLLM
35
18
0
03 May 2024
FlexiFilm: Long Video Generation with Flexible Conditions
FlexiFilm: Long Video Generation with Flexible Conditions
Yichen Ouyang
Jianhao Yuan
Hao Zhao
Gaoang Wang
Bo Zhao
DiffM
42
7
0
29 Apr 2024
On the Content Bias in Fréchet Video Distance
On the Content Bias in Fréchet Video Distance
Jason S. Hoffman
Aniruddha Mahapatra
Gaurav Parmar
Jun-Yan Zhu
Jia-Bin Huang
EGVM
50
15
0
18 Apr 2024
Predicting Long-horizon Futures by Conditioning on Geometry and Time
Predicting Long-horizon Futures by Conditioning on Geometry and Time
Tarasha Khurana
Deva Ramanan
AI4TS
55
0
0
17 Apr 2024
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale
  Prediction
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Keyu Tian
Yi-Xin Jiang
Zehuan Yuan
Bingyue Peng
Liwei Wang
VGen
42
260
0
03 Apr 2024
CLIP-VQDiffusion : Langauge Free Training of Text To Image generation
  using CLIP and vector quantized diffusion model
CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
S. Han
Joohee Kim
DiffM
CLIP
34
1
0
22 Mar 2024
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Zhengqing Yuan
Ruoxi Chen
Zhaoxu Li
Haolong Jia
Lifang He
Chi Wang
Lichao Sun
VGen
65
27
0
20 Mar 2024
VisualCritic: Making LMMs Perceive Visual Quality Like Humans
VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Zhipeng Huang
Zhizheng Zhang
Yiting Lu
Zheng-Jun Zha
Zhibo Chen
Baining Guo
MLLM
55
12
0
19 Mar 2024
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Lei Zhu
Fangyun Wei
Yanye Lu
MLLM
VLM
52
17
0
12 Mar 2024
Previous
12345
Next