ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.02591
  4. Cited By
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
  Tuning

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

5 September 2023
L. Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin Muller
O. Yu. Golovneva
Tianlu Wang
Arun Babu
Binh Tang
Brian Karrer
Shelly Sheynin
Candace Ross
Adam Polyak
Russell Howes
Vasu Sharma
Puxin Xu
Hovhannes Tamoyan
Oron Ashual
Uriel Singer
Shang-Wen Li
Susan Zhang
Rich James
Gargi Ghosh
Yaniv Taigman
Maryam Fazel-Zarandi
Asli Celikyilmaz
Luke Zettlemoyer
Armen Aghajanyan
    MLLM
ArXivPDFHTML

Papers citing "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning"

35 / 35 papers shown
Title
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
Xu Ma
Peize Sun
Haoyu Ma
Hao Tang
Chih-Yao Ma
...
Matt Feiszli
Peizhao Zhang
Peter Vajda
Sam S. Tsai
Y. Fu
136
2
0
24 Apr 2025
A Template Is All You Meme
A Template Is All You Meme
Luke Bates
Peter Ebert Christensen
Preslav Nakov
Iryna Gurevych
VLM
99
1
0
20 Feb 2025
GeoDiffuser: Geometry-Based Image Editing with Diffusion Models
GeoDiffuser: Geometry-Based Image Editing with Diffusion Models
Rahul Sajnani
Jeroen Vanbaar
Jie Min
Kapil D. Katyal
Srinath Sridhar
DiffM
98
11
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
287
55
0
03 Jan 2025
Olympus: A Universal Task Router for Computer Vision Tasks
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip Torr
VLM
ObjD
427
0
0
12 Dec 2024
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Qu He
Jinlong Peng
P. Xu
Boyuan Jiang
Xiaobin Hu
...
Yang Liu
Yun Wang
Chengjie Wang
Xuelong Li
Jing Zhang
DiffM
158
1
0
04 Dec 2024
Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects
Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects
Weimin Qiu
Jieke Wang
Meng Tang
DiffM
123
0
0
28 Nov 2024
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri
Bartłomiej Cupiał
Samuel Coward
Ulyana Piterbarg
Maciej Wolczyk
...
Lerrel Pinto
Rob Fergus
Jakob Foerster
Jack Parker-Holder
Tim Rocktaschel
LLMAG
LRM
189
19
0
20 Nov 2024
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation
Gihyun Kwon
Jong Chul Ye
DiffM
84
5
0
08 Oct 2024
TLDR: Token-Level Detective Reward Model for Large Vision Language Models
TLDR: Token-Level Detective Reward Model for Large Vision Language Models
Deqing Fu
Tong Xiao
Rui Wang
Wang Zhu
Pengchuan Zhang
Guan Pang
Robin Jia
Lawrence Chen
104
7
0
07 Oct 2024
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Wanpeng Zhang
Zilong Xie
Yicheng Feng
Yijiang Li
Xingrun Xing
Sipeng Zheng
Zongqing Lu
MLLM
54
1
0
03 Oct 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Ping Luo
Xinyue Li
Qi Qin
Yu Qiao
Hongsheng Li
Peng Gao
MLLM
128
54
0
05 Aug 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
114
34
0
07 Jun 2024
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Run Luo
Yunshui Li
Longze Chen
Wanwei He
Ting-En Lin
...
Zikai Song
Xiaobo Xia
Tongliang Liu
Min Yang
Binyuan Hui
VLM
DiffM
99
22
0
24 May 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
159
307
0
16 May 2024
Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images
Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images
Zhuoran Yu
Chenchen Zhu
Sean Culatana
Raghuraman Krishnamoorthi
Fanyi Xiao
Yong Jae Lee
146
15
0
04 Dec 2023
Adding Conditional Control to Text-to-Image Diffusion Models
Adding Conditional Control to Text-to-Image Diffusion Models
Lvmin Zhang
Anyi Rao
Maneesh Agrawala
AI4CE
122
4,106
1
10 Feb 2023
OPT-IML: Scaling Language Model Instruction Meta Learning through the
  Lens of Generalization
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
Srinivasan Iyer
Xi Lin
Ramakanth Pasunuru
Todor Mihaylov
Daniel Simig
...
Jeff Wang
Christopher Dewan
Asli Celikyilmaz
Luke Zettlemoyer
Veselin Stoyanov
ALM
123
266
0
22 Dec 2022
Faster Maximum Inner Product Search in High Dimensions
Faster Maximum Inner Product Search in High Dimensions
Mo Tiwari
Ryan Kang
Je-Yong Lee
Luke Lee
Chris Piech
Sebastian Thrun
Ilan Shomorony
Martin Jinye Zhang
53
6
0
14 Dec 2022
InstructPix2Pix: Learning to Follow Image Editing Instructions
InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks
Aleksander Holynski
Alexei A. Efros
DiffM
193
1,796
0
17 Nov 2022
Contrastive Decoding: Open-ended Text Generation as Optimization
Contrastive Decoding: Open-ended Text Generation as Optimization
Xiang Lisa Li
Ari Holtzman
Daniel Fried
Percy Liang
Jason Eisner
Tatsunori Hashimoto
Luke Zettlemoyer
M. Lewis
91
358
0
27 Oct 2022
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Wenhu Chen
Hexiang Hu
Chitwan Saharia
William W. Cohen
VLM
154
173
0
29 Sep 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science
  Question Answering
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
261
1,240
0
20 Sep 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang Luong
Gunjan Baid
...
Zarana Parekh
Xin Li
Han Zhang
Jason Baldridge
Yonghui Wu
EGVM
178
1,114
0
22 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
364
3,535
0
29 Apr 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya A. Ramesh
Prafulla Dhariwal
Alex Nichol
Casey Chu
Mark Chen
VLM
DiffM
370
6,854
0
13 Apr 2022
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
Oran Gafni
Adam Polyak
Oron Ashual
Shelly Sheynin
Devi Parikh
Yaniv Taigman
DiffM
59
520
0
24 Mar 2022
GLIDE: Towards Photorealistic Image Generation and Editing with
  Text-Guided Diffusion Models
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol
Prafulla Dhariwal
Aditya A. Ramesh
Pranav Shyam
Pamela Mishkin
Bob McGrew
Ilya Sutskever
Mark Chen
323
3,594
0
20 Dec 2021
Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
391
4,937
0
24 Feb 2021
Connecting Vision and Language with Localized Narratives
Connecting Vision and Language with Localized Narratives
Jordi Pont-Tuset
J. Uijlings
Soravit Changpinyo
Radu Soricut
V. Ferrari
ObjD
81
249
0
06 Dec 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
414
20,114
0
23 Oct 2019
VizWiz Grand Challenge: Answering Visual Questions from Blind People
VizWiz Grand Challenge: Answering Visual Questions from Blind People
Danna Gurari
Qing Li
Abigale Stangl
Anhong Guo
Chi Lin
Kristen Grauman
Jiebo Luo
Jeffrey P. Bigham
CoGe
88
847
0
22 Feb 2018
Neural Discrete Representation Learning
Neural Discrete Representation Learning
Aaron van den Oord
Oriol Vinyals
Koray Kavukcuoglu
BDL
SSL
OCL
224
5,004
0
02 Nov 2017
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
328
3,235
0
02 Dec 2016
Microsoft COCO Captions: Data Collection and Evaluation Server
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
209
2,475
0
01 Apr 2015
1