Papers
Communities
Organizations
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.00785
Cited By
Sequential Modeling Enables Scalable Learning for Large Vision Models
1 December 2023
Yutong Bai
Xinyang Geng
K. Mangalam
Amir Bar
Alan Yuille
Trevor Darrell
Jitendra Malik
Alexei A. Efros
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Sequential Modeling Enables Scalable Learning for Large Vision Models"
50 / 129 papers shown
Title
Conquering the Retina: Bringing Visual in-Context Learning to OCT
Alessio Negrini
Simon Reiß
28
0
0
18 Jun 2025
TARDIS STRIDE: A Spatio-Temporal Road Image Dataset and World Model for Autonomy
Héctor Carrión
Yutong Bai
Víctor A. Hernández Castro
Kishan Panaganti
Ayush Zenith
Matthew Trang
Tony Zhang
Pietro Perona
Jitendra Malik
VGen
54
0
0
12 Jun 2025
Vision Generalist Model: A Survey
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
78
0
0
11 Jun 2025
A Generative Adaptive Replay Continual Learning Model for Temporal Knowledge Graph Reasoning
Zhiyu Zhang
Wei Chen
Youfang Lin
Huaiyu Wan
OffRL
CLL
128
0
0
04 Jun 2025
CONCORD: Concept-Informed Diffusion for Dataset Distillation
Jianyang Gu
Haonan Wang
Ruoxi Jia
Saeed Vahidian
Vyacheslav Kungurtsev
Wei Jiang
Yiran Chen
DiffM
DD
960
0
0
23 May 2025
Video-GPT via Next Clip Diffusion
Shaobin Zhuang
Zhipeng Huang
Ying Zhang
Fangyikang Wang
Canmiao Fu
Binxin Yang
Chong Sun
Chen Li
Yali Wang
DiffM
VGen
273
1
0
18 May 2025
Patient-Specific Autoregressive Models for Organ Motion Prediction in Radiotherapy
Yuxiang Lai
Jike Zhong
Vanessa Su
Xiaofeng Yang
104
0
0
17 May 2025
Visual Planning: Let's Think Only with Images
Yi Xu
Chengzu Li
Han Zhou
Xingchen Wan
Caiqi Zhang
Anna Korhonen
Ivan Vulić
LM&Ro
LRM
176
3
0
16 May 2025
Improving Routing in Sparse Mixture of Experts with Graph of Tokens
Tam Minh Nguyen
Ngoc N. Tran
Khai Nguyen
Richard G. Baraniuk
MoE
127
0
0
01 May 2025
RayZer: A Self-supervised Large View Synthesis Model
Hanwen Jiang
Hao Tan
Peng Wang
Haian Jin
Yue Zhao
...
Kai Zhang
Fujun Luan
Kalyan Sunkavalli
Qixing Huang
Georgios Pavlakos
188
3
0
01 May 2025
A Survey of Interactive Generative Video
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
Xinyu Wang
Pengfei Wan
Di Zhang
Kun Gai
Hao Chen
Xihui Liu
VGen
127
4
0
30 Apr 2025
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
187
1
0
28 Apr 2025
E-InMeMo: Enhanced Prompting for Visual In-Context Learning
Jiahao Zhang
Bowen Wang
Hong Liu
Liangzhi Li
Yuta Nakashima
Hajime Nagahara
VLM
184
0
0
25 Apr 2025
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
Junliang Guo
Yang Ye
Tianyu He
Haoyu Wu
Yushu Jiang
Tim Pearce
Li Zhao
VGen
SyDa
132
14
0
11 Apr 2025
Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation
Jiaming Chen
Wentao Zhao
Ziyu Meng
Donghui Mao
Ran Song
Wei Pan
Wei Zhang
145
1
0
07 Apr 2025
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Dongchao Yang
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
178
6
0
03 Apr 2025
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Koustuv Sinha
Zhuang Liu
...
Michael G. Rabbat
Nicolas Ballas
Yann LeCun
Amir Bar
Saining Xie
CLIP
VLM
Presented at
ResearchTrend Connect | VLM
on
04 Jun 2025
195
9
0
01 Apr 2025
Test-Time Visual In-Context Tuning
Jiahao Xie
A. Tonioni
N. Rauschmayr
F. Tombari
Bernt Schiele
OOD
VLM
91
1
0
27 Mar 2025
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Alex Jinpeng Wang
Linjie Li
Zhiyong Yang
Lijuan Wang
Min Li
DiffM
119
1
0
26 Mar 2025
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Yuchao Gu
Weijia Mao
Mike Zheng Shou
VGen
185
17
0
25 Mar 2025
Position: Interactive Generative Video as Next-Generation Game Engine
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
Xintao Wang
Pengfei Wan
Di Zhang
Xihui Liu
VGen
126
4
0
21 Mar 2025
CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation
Masud Ahmed
Zahid Hasan
Syed Arefinul Haque
A. Faridee
S. Purushotham
Suya You
Nirmalya Roy
206
0
0
19 Mar 2025
Fast Autoregressive Video Generation with Diagonal Decoding
Yang Ye
Junliang Guo
Haoyu Wu
Tianyu He
Tim Pearce
Tabish Rashid
Katja Hofmann
Li Zhao
DiffM
VGen
122
3
0
18 Mar 2025
The Power of Context: How Multimodality Improves Image Super-Resolution
Kangfu Mei
Hossein Talebi
Mojtaba Ardakani
Vishal M. Patel
P. Milanfar
M. Delbracio
DiffM
134
4
0
18 Mar 2025
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Lan Chen
Qi Mao
Yuchao Gu
Mike Zheng Shou
177
4
0
17 Mar 2025
Direction-Aware Diagonal Autoregressive Image Generation
Yijia Xu
Jianzhong Ju
Jian Luan
J. Cui
193
1
0
14 Mar 2025
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Yijing Lin
Mengqi Huang
Shuhan Zhuang
Zhendong Mao
VGen
114
3
0
13 Mar 2025
SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
Shijia Zhao
Qiming Xia
Xusheng Guo
Pufan Zou
Maoji Zheng
Hai Wu
Chenglu Wen
Cheng-Yu Wang
3DPC
143
0
0
09 Mar 2025
Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Xiaohao Xu
Feng Xue
Xianrui Li
Haowei Li
Steve Yang
Tianze Zhang
Matthew Johnson-Roberson
Xiaonan Huang
3DV
76
0
0
08 Mar 2025
Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D
Jiesi Hu
Chenfei Ye
Yanwu Yang
Xutao Guo
Yang Shang
P. Shi
Hanyang Peng
Ting Ma
119
0
0
04 Mar 2025
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
Siyu Jiao
Gengwei Zhang
Yinlong Qian
Jiancheng Huang
Yao Zhao
Humphrey Shi
Lin Ma
Y. X. Wei
Zequn Jie
VLM
114
8
0
27 Feb 2025
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Zekun Qi
Wenyao Zhang
Yufei Ding
Runpei Dong
Xinqiang Yu
...
Xin Jin
Kaisheng Ma
Zhizheng Zhang
He Wang
Li Yi
LM&Ro
220
12
0
18 Feb 2025
Audio Texture Manipulation by Exemplar-Based Analogy
Kan Jen Cheng
Tingle Li
Gopala Anumanchipalli
DiffM
85
1
0
21 Jan 2025
How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?
Wenxuan Li
Alan Yuille
Zongwei Zhou
MedIm
150
11
0
20 Jan 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
209
45
0
31 Dec 2024
VarAD: Lightweight High-Resolution Image Anomaly Detection via Visual Autoregressive Modeling
Yunkang Cao
Haiming Yao
Wei Luo
Nong Sang
148
7
0
23 Dec 2024
Scaling 4D Representations
João Carreira
Dilara Gokay
Michael King
Chuhan Zhang
Ignacio Rocco
...
Viorica Patraucean
Dima Damen
Pauline Luc
Mehdi S. M. Sajjadi
Andrew Zisserman
150
5
0
19 Dec 2024
Next Patch Prediction for Autoregressive Visual Generation
Yatian Pang
Peng Jin
Shuo Yang
Bin Lin
Bin Zhu
...
Liuhan Chen
Francis E. H. Tay
Ser-Nam Lim
Harry Yang
Li Yuan
255
13
0
19 Dec 2024
MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data
Hanwen Jiang
Zexiang Xu
Desai Xie
Zheyu Chen
Haian Jin
...
Xin Sun
Jiuxiang Gu
Qixing Huang
Georgios Pavlakos
Hao Tan
506
5
0
18 Dec 2024
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
S. Nagendra
Kashif Rashid
Chaopeng Shen
Daniel Kifer
VLM
147
2
0
16 Dec 2024
Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
Bolin Lai
F. Xu
Miao Liu
Xiaoliang Dai
Nikhil Mehta
...
Zeyi Huang
James M. Rehg
Sangmin Lee
Ning Zhang
Tong Xiao
143
3
0
02 Dec 2024
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
Teng Zhou
Xiaoyu Zhang
Yongchuan Tang
MLLM
DiffM
207
1
0
24 Nov 2024
There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks
Miguel Espinosa
Chenhongyi Yang
Linus Ericsson
Jingyu Sun
Elliot J. Crowley
VLM
124
1
0
22 Nov 2024
UniFlow: A Foundation Model for Unified Urban Spatio-Temporal Flow Prediction
Yuan Yuan
Jingtao Ding
Chonghua Han
Depeng Jin
Yong Li
Yong Li
AI4TS
AI4CE
199
2
0
20 Nov 2024
LaVin-DiT: Large Vision Diffusion Transformer
Zhaoqing Wang
Xiaobo Xia
Runnan Chen
Dongdong Yu
Changhu Wang
Mingming Gong
Tongliang Liu
204
9
0
18 Nov 2024
GFT: Graph Foundation Model with Transferable Tree Vocabulary
Zehong Wang
Zheyuan Zhang
Nitesh Chawla
Chuxu Zhang
Yanfang Ye
108
20
0
09 Nov 2024
Moving Off-the-Grid: Scene-Grounded Video Representations
Sjoerd van Steenkiste
Daniel Zoran
Yi Yang
Yulia Rubanova
Rishabh Kabra
...
Thomas Keck
João Carreira
Alexey Dosovitskiy
Mehdi S. M. Sajjadi
Thomas Kipf
81
4
0
08 Nov 2024
Autoregressive Models in Vision: A Survey
Jing Xiong
Gongye Liu
Lun Huang
Chengyue Wu
Taiqiang Wu
...
Hao Fei
Guillermo Sapiro
Jiebo Luo
Ping Luo
Ngai Wong
VGen
203
14
0
08 Nov 2024
Analyzing The Language of Visual Tokens
David M. Chan
Rodolfo Corona
J. S. Park
Cheol Jun Cho
Yutong Bai
Trevor Darrell
45
4
0
07 Nov 2024
Expanding Sparse Tuning for Low Memory Usage
Shufan Shen
Junshu Sun
Xiangyang Ji
Qingming Huang
Shuhui Wang
121
0
0
04 Nov 2024
1
2
3
Next