Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2410.11087
Cited By
v1
v2 (latest)
Locality Alignment Improves Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
14 October 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Locality Alignment Improves Vision-Language Models"
50 / 123 papers shown
Title
RL makes MLLMs see better than SFT
Junha Song
Sangdoo Yun
Dongyoon Han
Jaegul Choo
Byeongho Heo
OffRL
107
0
0
18 Oct 2025
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
Hubert Baniecki
Maximilian Muschalik
Fabian Fumagalli
Barbara Hammer
Eyke Hüllermeier
P. Biecek
FAtt
142
0
0
07 Aug 2025
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions
Ziteng Wang
Siqi Yang
Limeng Qiao
Lin Ma
VLM
197
0
0
04 Aug 2025
Visual symbolic mechanisms: Emergent symbol processing in vision language models
Rim Assouel
Declan Campbell
Taylor Webb
110
2
0
18 Jun 2025
Bias and Generalizability of Foundation Models across Datasets in Breast Mammography
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Elodie Germani
Selin Türk Ilayda
Zeineddine Fatima
Mourad Charbel
Shadi Albarqouni
AI4CE
284
3
0
14 May 2025
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei Shen
...
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Yunbo Wang
VLM
426
4
0
04 Mar 2025
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
S M Sarwar
310
2
0
25 Feb 2025
Demystifying CLIP Data
International Conference on Learning Representations (ICLR), 2023
Hu Xu
Saining Xie
Xiaoqing Ellen Tan
Po-Yao (Bernie) Huang
Russell Howes
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
Luke Zettlemoyer
Christoph Feichtenhofer
VLM
CLIP
419
189
0
31 Dec 2024
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
...
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
VLM
MoE
OSLM
421
1,450
0
31 Jul 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DV
MLLM
293
586
0
24 Jun 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
424
567
0
16 May 2024
What matters when building vision-language models?
Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
228
265
0
03 May 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
280
19
0
28 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
European Conference on Computer Vision (ECCV), 2024
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLM
MLLM
309
163
0
18 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
355
240
0
14 Mar 2024
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu
Wen Liu
Bo Zhang
Bing-Li Wang
Kai Dong
...
Yaofeng Sun
Chengqi Deng
Hanwei Xu
Zhenda Xie
Chong Ruan
VLM
317
602
0
08 Mar 2024
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Siddharth Karamcheti
Suraj Nair
Ashwin Balakrishna
Percy Liang
Thomas Kollar
Dorsa Sadigh
MLLM
VLM
200
214
0
12 Feb 2024
Scalable Pre-training of Large Autoregressive Image Models
International Conference on Machine Learning (ICML), 2024
Alaaeldin El-Nouby
Michal Klein
Shuangfei Zhai
Miguel Angel Bautista
Alexander Toshev
Vaishaal Shankar
J. Susskind
Armand Joulin
VLM
214
106
0
16 Jan 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Computer Vision and Pattern Recognition (CVPR), 2024
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
VLM
MLLM
332
525
0
11 Jan 2024
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jitesh Jain
Jianwei Yang
Humphrey Shi
MLLM
142
45
0
21 Dec 2023
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Junke Wang
Lingchen Meng
Zejia Weng
Bo He
Zuxuan Wu
Yu-Gang Jiang
MLLM
VLM
220
132
0
13 Nov 2023
What's "up" with vision-language models? Investigating their struggle with spatial reasoning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Amita Kamath
Jack Hessel
Kai-Wei Chang
LRM
CoGe
262
193
0
30 Oct 2023
Mistral 7B
Albert Q. Jiang
Alexandre Sablayrolles
A. Mensch
Chris Bamford
Devendra Singh Chaplot
...
Teven Le Scao
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoE
LRM
318
2,838
0
10 Oct 2023
Improved Baselines with Visual Instruction Tuning
Computer Vision and Pattern Recognition (CVPR), 2023
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLM
MLLM
500
3,935
0
05 Oct 2023
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
International Conference on Learning Representations (ICLR), 2023
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Xiangtai Li
Wentao Liu
Chen Change Loy
CLIP
VLM
205
98
0
02 Oct 2023
Data Filtering Networks
International Conference on Learning Representations (ICLR), 2023
Alex Fang
Albin Madappally Jose
Amit Jain
Ludwig Schmidt
Alexander Toshev
Vaishaal Shankar
CLIP
311
203
0
29 Sep 2023
Vision Transformers Need Registers
International Conference on Learning Representations (ICLR), 2023
Zilong Chen
Maxime Oquab
Julien Mairal
Huaping Liu
ViT
308
570
0
28 Sep 2023
Contrastive Feature Masking Open-Vocabulary Vision Transformer
IEEE International Conference on Computer Vision (ICCV), 2023
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
VLM
254
35
0
02 Sep 2023
MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation
R. Birkl
Diana Wofk
Matthias Muller
MDE
243
198
0
26 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
4.2K
14,778
0
18 Jul 2023
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Neural Information Processing Systems (NeurIPS), 2023
Mostafa Dehghani
Basil Mustafa
Josip Djolonga
Jonathan Heek
Matthias Minderer
...
Avital Oliver
Piotr Padlewski
A. Gritsenko
Mario Luvcić
N. Houlsby
ViT
305
177
0
12 Jul 2023
Kosmos-2: Grounding Multimodal Large Language Models to the World
International Conference on Learning Representations (ICLR), 2023
Zhiliang Peng
Wenhui Wang
Li Dong
Y. Hao
Shaohan Huang
Shuming Ma
Furu Wei
MLLM
ObjD
VLM
281
986
0
26 Jun 2023
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
International Conference on Learning Representations (ICLR), 2023
Fuxiao Liu
Kevin Qinghong Lin
Linjie Li
Jianfeng Wang
Yaser Yacoob
Lijuan Wang
VLM
MLLM
327
383
0
26 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Neural Information Processing Systems (NeurIPS), 2023
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
682
80
0
13 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Neural Information Processing Systems (NeurIPS), 2023
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
2.1K
6,197
0
09 Jun 2023
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Neural Information Processing Systems (NeurIPS), 2023
Ibrahim Alabdulmohsin
Xiaohua Zhai
Alexander Kolesnikov
Lucas Beyer
VLM
434
86
0
22 May 2023
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Neural Information Processing Systems (NeurIPS), 2023
Wen Wang
Zhe Chen
Xiaokang Chen
Jiannan Wu
Xizhou Zhu
...
Ping Luo
Tong Lu
Jie Zhou
Yu Qiao
Jifeng Dai
MLLM
VLM
242
604
0
18 May 2023
Evaluating Object Hallucination in Large Vision-Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yifan Li
Yifan Du
Kun Zhou
Jinpeng Wang
Wayne Xin Zhao
Ji-Rong Wen
MLLM
LRM
635
1,176
0
17 May 2023
An Inverse Scaling Law for CLIP Training
Neural Information Processing Systems (NeurIPS), 2023
Xianhang Li
Zeyu Wang
Cihang Xie
VLM
CLIP
209
74
0
11 May 2023
DataComp: In search of the next generation of multimodal datasets
Neural Information Processing Systems (NeurIPS), 2023
S. Gadre
Gabriel Ilharco
Alex Fang
J. Hayase
Georgios Smyrnis
...
A. Dimakis
J. Jitsev
Y. Carmon
Vaishaal Shankar
Ludwig Schmidt
VLM
393
559
0
27 Apr 2023
Visual Instruction Tuning
Neural Information Processing Systems (NeurIPS), 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
838
6,977
0
17 Apr 2023
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab
Timothée Darcet
Théo Moutakanni
Huy Q. Vo
Marc Szafraniec
...
Edouard Grave
Julien Mairal
Patrick Labatut
Armand Joulin
Piotr Bojanowski
VLM
CLIP
SSL
938
5,533
0
14 Apr 2023
Segment Anything
IEEE International Conference on Computer Vision (ICCV), 2023
A. Kirillov
Eric Mintun
Nikhila Ravi
Hanzi Mao
Chloe Rolland
...
Spencer Whitehead
Alexander C. Berg
Wan-Yen Lo
Piotr Dollár
Ross B. Girshick
MLLM
VLM
782
10,432
0
05 Apr 2023
Sigmoid Loss for Language Image Pre-Training
IEEE International Conference on Computer Vision (ICCV), 2023
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIP
VLM
1.1K
2,028
0
27 Mar 2023
EVA-02: A Visual Representation for Neon Genesis
Image and Vision Computing (IVC), 2023
Yuxin Fang
Quan-Sen Sun
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
ViT
CLIP
324
383
0
20 Mar 2023
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
International Conference on Machine Learning (ICML), 2023
Shuangfei Zhai
Tatiana Likhomanenko
Etai Littwin
Dan Busbridge
Jason Ramapuram
Yizhe Zhang
Jiatao Gu
J. Susskind
AAML
279
108
0
11 Mar 2023
Scaling Vision Transformers to 22 Billion Parameters
International Conference on Machine Learning (ICML), 2023
Mostafa Dehghani
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Jonathan Heek
...
Mario Luvcić
Xiaohua Zhai
Daniel Keysers
Jeremiah Harmsen
N. Houlsby
MLLM
359
731
0
10 Feb 2023
Reproducible scaling laws for contrastive language-image learning
Computer Vision and Pattern Recognition (CVPR), 2022
Mehdi Cherti
Romain Beaumont
Ross Wightman
Mitchell Wortsman
Gabriel Ilharco
Cade Gordon
Christoph Schuhmann
Ludwig Schmidt
J. Jitsev
VLM
CLIP
333
1,106
0
14 Dec 2022
Scaling Language-Image Pre-training via Masking
Computer Vision and Pattern Recognition (CVPR), 2022
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIP
VLM
286
378
0
01 Dec 2022
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Computer Vision and Pattern Recognition (CVPR), 2022
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
CLIP
484
872
0
14 Nov 2022
1
2
3
Next