Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2503.15485
Cited By
v1
v2 (latest)
TULIP: Towards Unified Language-Image Pretraining
19 March 2025
Zineng Tang
Long Lian
Seun Eisape
Xudong Wang
Roei Herzig
Adam Yala
Alane Suhr
Trevor Darrell
David M. Chan
VLM
CLIP
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"TULIP: Towards Unified Language-Image Pretraining"
50 / 52 papers shown
Title
JAFAR: Jack up Any Feature at Any Resolution
Paul Couairon
Loick Chambon
Louis Serrano
Jean-Emmanuel Haugeard
Matthieu Cord
Nicolas Thome
MDE
40
0
0
10 Jun 2025
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Heekyung Lee
Jiaxin Ge
Tsung-Han Wu
Minwoo Kang
Trevor Darrell
David M. Chan
ReLM
CoGe
LRM
40
0
0
29 May 2025
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu
Fangfu Liu
Yi-Hsin Hung
Yueqi Duan
LRM
79
1
0
29 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
155
1
0
08 May 2025
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Koustuv Sinha
Zhuang Liu
...
Michael G. Rabbat
Nicolas Ballas
Yann LeCun
Amir Bar
Saining Xie
CLIP
VLM
Presented at
ResearchTrend Connect | VLM
on
04 Jun 2025
172
6
0
01 Apr 2025
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Michael Tschannen
A. Gritsenko
Xiao Wang
Muhammad Ferjad Naeem
Ibrahim Alabdulmohsin
...
Basil Mustafa
Olivier J. Hénaff
Jeremiah Harmsen
Andreas Steiner
Xiaohua Zhai
VLM
144
80
0
21 Feb 2025
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
157
12
0
29 Aug 2024
What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li
Haoqin Tu
Mude Hui
Zeyu Wang
Bingchen Zhao
...
Jieru Mei
Qing Liu
Huangjie Zheng
Yuyin Zhou
Cihang Xie
VLM
MLLM
92
42
0
12 Jun 2024
The Platonic Representation Hypothesis
Minyoung Huh
Brian Cheung
Tongzhou Wang
Phillip Isola
138
142
0
13 May 2024
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu
Yushi Hu
Bangzheng Li
Yu Feng
Haoyu Wang
Xudong Lin
Dan Roth
Noah A. Smith
Wei-Chiu Ma
Ranjay Krishna
VLM
LRM
MLLM
148
150
0
18 Apr 2024
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Q. Garrido
Jean Ponce
Xinlei Chen
Michael G. Rabbat
Yann LeCun
Mahmoud Assran
Nicolas Ballas
MDE
VLM
155
87
0
15 Feb 2024
Rethinking Patch Dependence for Masked Autoencoders
Letian Fu
Long Lian
Renhao Wang
Baifeng Shi
Xudong Wang
Adam Yala
Trevor Darrell
Alexei A. Efros
Ken Goldberg
142
16
0
25 Jan 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
VLM
MLLM
151
349
0
11 Jan 2024
CLAIR: Evaluating Image Captions with Large Language Models
David M. Chan
Suzanne Petryk
Joseph E. Gonzalez
Trevor Darrell
John F. Canny
94
21
0
19 Oct 2023
Improved Baselines with Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLM
MLLM
237
2,830
0
05 Oct 2023
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
Yonglong Tian
Lijie Fan
Phillip Isola
Huiwen Chang
Dilip Krishnan
VLM
DiffM
145
153
0
01 Jun 2023
Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation
Lisa Dunlap
Alyssa Umino
Han Zhang
Jiezhi Yang
Joseph E. Gonzalez
Trevor Darrell
DiffM
91
79
0
25 May 2023
Unsupervised Semantic Correspondence Using Stable Diffusion
Eric Hedlin
Gopal Sharma
Shweta Mahajan
Hossam N. Isack
Abhishek Kar
Andrea Tagliasacchi
K. M. Yi
DiffM
106
95
0
24 May 2023
DataComp: In search of the next generation of multimodal datasets
S. Gadre
Gabriel Ilharco
Alex Fang
J. Hayase
Georgios Smyrnis
...
A. Dimakis
J. Jitsev
Y. Carmon
Vaishaal Shankar
Ludwig Schmidt
VLM
118
452
0
27 Apr 2023
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
582
4,945
0
17 Apr 2023
Synthetic Data from Diffusion Models Improves ImageNet Classification
Shekoofeh Azizi
Simon Kornblith
Chitwan Saharia
Mohammad Norouzi
David J. Fleet
VLM
DiffM
125
316
0
17 Apr 2023
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab
Timothée Darcet
Théo Moutakanni
Huy Q. Vo
Marc Szafraniec
...
Hervé Jégou
Julien Mairal
Patrick Labatut
Armand Joulin
Piotr Bojanowski
VLM
CLIP
SSL
488
3,529
0
14 Apr 2023
Diffusion Models as Masked Autoencoders
Chen Wei
K. Mangalam
Po-Yao (Bernie) Huang
Yanghao Li
Haoqi Fan
Hu Xu
Huiyu Wang
Cihang Xie
Alan Yuille
Christoph Feichtenhofer
DiffM
SyDa
100
53
0
06 Apr 2023
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIP
VLM
298
1,206
0
27 Mar 2023
MVImgNet: A Large-scale Dataset of Multi-view Images
Xianggang Yu
Mutian Xu
Yidan Zhang
Haolin Liu
Chongjie Ye
...
Zhangyang Xiong
Tianyou Liang
Guanying Chen
Shuguang Cui
Xiaoguang Han
3DV
138
171
0
10 Mar 2023
Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot Classification via Stable Diffusion
Jordan Shipard
Arnold Wiliem
Kien Nguyen Thanh
Wei Xiang
Clinton Fookes
DiffM
108
77
0
07 Feb 2023
Effective Data Augmentation With Diffusion Models
Brandon Trabucco
Kyle Doherty
Max Gurinas
Ruslan Salakhutdinov
DiffM
VLM
116
257
0
07 Feb 2023
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Mahmoud Assran
Quentin Duval
Ishan Misra
Piotr Bojanowski
Pascal Vincent
Michael G. Rabbat
Yann LeCun
Nicolas Ballas
SSL
AI4TS
MDE
143
361
0
19 Jan 2023
RxRx1: A Dataset for Evaluating Experimental Batch Correction Methods
Maciej Sypetkowski
Morteza Rezanejad
Saber Saberian
Oren Z. Kraus
John Urbanik
...
Mason L. Victors
J. Yosinski
A. R. Sereshkeh
I. Haque
Berton Earnshaw
93
38
0
13 Jan 2023
InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks
Aleksander Holynski
Alexei A. Efros
DiffM
215
1,841
0
17 Nov 2022
Is synthetic data from generative models ready for image recognition?
Ruifei He
Shuyang Sun
Xin Yu
Chuhui Xue
Wenqing Zhang
Philip Torr
Song Bai
Xiaojuan Qi
132
302
0
14 Oct 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
147
429
0
07 Apr 2022
SLIP: Self-supervision meets Language-Image Pre-training
Norman Mu
Alexander Kirillov
David Wagner
Saining Xie
VLM
CLIP
158
492
0
23 Dec 2021
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
561
7,866
0
11 Nov 2021
InfographicVQA
Minesh Mathew
Viraj Bagal
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
C. V. Jawahar
110
242
0
26 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
215
1,193
0
01 Apr 2021
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Jure Zbontar
Li Jing
Ishan Misra
Yann LeCun
Stéphane Deny
SSL
360
2,374
0
04 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
1.1K
30,053
0
26 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
519
3,911
0
11 Feb 2021
Dense Contrastive Learning for Self-Supervised Visual Pre-Training
Xinlong Wang
Rufeng Zhang
Chunhua Shen
Tao Kong
Lei Li
SSL
101
689
0
18 Nov 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
744
41,796
0
22 Oct 2020
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
Mathilde Caron
Ishan Misra
Julien Mairal
Priya Goyal
Piotr Bojanowski
Armand Joulin
OCL
SSL
342
4,109
0
17 Jun 2020
Bootstrap your own latent: A new approach to self-supervised Learning
Jean-Bastien Grill
Florian Strub
Florent Altché
Corentin Tallec
Pierre Harvey Richemond
...
M. G. Azar
Bilal Piot
Koray Kavukcuoglu
Rémi Munos
Michal Valko
SSL
468
6,864
0
13 Jun 2020
Are we done with ImageNet?
Lucas Beyer
Olivier J. Hénaff
Alexander Kolesnikov
Xiaohua Zhai
Aaron van den Oord
VLM
139
408
0
12 Jun 2020
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen
Haoqi Fan
Ross B. Girshick
Kaiming He
SSL
567
3,450
0
09 Mar 2020
A Simple Framework for Contrastive Learning of Visual Representations
Ting-Li Chen
Simon Kornblith
Mohammad Norouzi
Geoffrey E. Hinton
SSL
431
18,968
0
13 Feb 2020
Momentum Contrast for Unsupervised Visual Representation Learning
Kaiming He
Haoqi Fan
Yuxin Wu
Saining Xie
Ross B. Girshick
SSL
272
12,164
0
13 Nov 2019
Deep Clustering for Unsupervised Learning of Visual Features
Mathilde Caron
Piotr Bojanowski
Armand Joulin
Matthijs Douze
SSL
179
1,906
0
15 Jul 2018
Functional Map of the World
Gordon A. Christie
Neil Fendley
James Wilson
R. Mukherjee
VGen
87
399
0
21 Nov 2017
FOIL it! Find One mismatch between Image and Language caption
Ravi Shekhar
Sandro Pezzelle
Yauhen Klimovich
Aurélie Herbelot
Moin Nabi
E. Sangineto
Raffaella Bernardi
65
141
0
03 May 2017
1
2
Next