ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.11087
  4. Cited By
Locality Alignment Improves Vision-Language Models
v1v2 (latest)

Locality Alignment Improves Vision-Language Models

International Conference on Learning Representations (ICLR), 2024
14 October 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
    VLM
ArXiv (abs)PDFHTML

Papers citing "Locality Alignment Improves Vision-Language Models"

50 / 123 papers shown
Title
LAION-5B: An open large-scale dataset for training next generation
  image-text models
LAION-5B: An open large-scale dataset for training next generation image-text modelsNeural Information Processing Systems (NeurIPS), 2022
Christoph Schuhmann
Romain Beaumont
Richard Vencu
Cade Gordon
Ross Wightman
...
Srivatsa Kundurthy
Katherine Crowson
Ludwig Schmidt
R. Kaczmarczyk
J. Jitsev
VLMMLLMCLIP
508
4,383
0
16 Oct 2022
Vision Transformers provably learn spatial structure
Vision Transformers provably learn spatial structureNeural Information Processing Systems (NeurIPS), 2022
Samy Jelassi
Michael E. Sander
Yuan-Fang Li
ViTMLT
167
99
0
13 Oct 2022
When and why vision-language models behave like bags-of-words, and what
  to do about it?
When and why vision-language models behave like bags-of-words, and what to do about it?International Conference on Learning Representations (ICLR), 2022
Mert Yuksekgonul
Federico Bianchi
Pratyusha Kalluri
Dan Jurafsky
James Zou
VLMCoGe
291
507
0
04 Oct 2022
Improving Self-Supervised Learning by Characterizing Idealized
  Representations
Improving Self-Supervised Learning by Characterizing Idealized RepresentationsNeural Information Processing Systems (NeurIPS), 2022
Yann Dubois
Tatsunori Hashimoto
Stefano Ermon
Abigail Z. Jacobs
SSL
253
45
0
13 Sep 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
  Pretraining
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image PretrainingComputer Vision and Pattern Recognition (CVPR), 2022
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIPVLM
205
215
0
25 Aug 2022
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
Zhiliang Peng
Li Dong
Hangbo Bao
QiXiang Ye
Furu Wei
281
383
0
12 Aug 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksInternational Conference on Learning Representations (ICLR), 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjDVLMMLLM
305
464
0
17 Jun 2022
A Unified Sequence Interface for Vision Tasks
A Unified Sequence Interface for Vision TasksNeural Information Processing Systems (NeurIPS), 2022
Ting-Li Chen
Saurabh Saxena
Lala Li
Nayeon Lee
David J. Fleet
Geoffrey E. Hinton
VLMMLLM
158
167
0
15 Jun 2022
Learning to Estimate Shapley Values with Vision Transformers
Learning to Estimate Shapley Values with Vision TransformersInternational Conference on Learning Representations (ICLR), 2022
Ian Covert
Chanwoo Kim
Su-In Lee
FAtt
199
51
0
10 Jun 2022
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
A-OKVQA: A Benchmark for Visual Question Answering using World KnowledgeEuropean Conference on Computer Vision (ECCV), 2022
Dustin Schwenk
Apoorv Khandelwal
Christopher Clark
Kenneth Marino
Roozbeh Mottaghi
261
722
0
03 Jun 2022
Vision Transformer Adapter for Dense Predictions
Vision Transformer Adapter for Dense PredictionsInternational Conference on Learning Representations (ICLR), 2022
Zhe Chen
Yuchen Duan
Wenhai Wang
Junjun He
Tong Lu
Jifeng Dai
Yu Qiao
651
730
0
17 May 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLMCLIPOffRL
495
1,563
0
04 May 2022
Visual Spatial Reasoning
Visual Spatial ReasoningTransactions of the Association for Computational Linguistics (TACL), 2022
Fangyu Liu
Guy Edward Toh Emerson
Nigel Collier
ReLM
340
246
0
30 Apr 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot LearningNeural Information Processing Systems (NeurIPS), 2022
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLMVLM
638
4,611
0
29 Apr 2022
Missingness Bias in Model Debugging
Missingness Bias in Model DebuggingInternational Conference on Learning Representations (ICLR), 2022
Saachi Jain
Hadi Salman
E. Wong
Pengchuan Zhang
Vibhav Vineet
Sai H. Vemprala
Aleksander Madry
196
42
0
19 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic
  Compositionality
Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityComputer Vision and Pattern Recognition (CVPR), 2022
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
294
501
0
07 Apr 2022
Exploring Plain Vision Transformer Backbones for Object Detection
Exploring Plain Vision Transformer Backbones for Object DetectionEuropean Conference on Computer Vision (ECCV), 2022
Yanghao Li
Hanzi Mao
Ross B. Girshick
Kaiming He
ViT
483
1,000
0
30 Mar 2022
data2vec: A General Framework for Self-supervised Learning in Speech,
  Vision and Language
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and LanguageInternational Conference on Machine Learning (ICML), 2022
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
SSLVLMViT
376
1,007
0
07 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkInternational Conference on Machine Learning (ICML), 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLMObjD
406
984
0
07 Feb 2022
Context Autoencoder for Self-Supervised Representation Learning
Context Autoencoder for Self-Supervised Representation LearningInternational Journal of Computer Vision (IJCV), 2022
Xiaokang Chen
Mingyu Ding
Xiaodi Wang
Ying Xin
Shentong Mo
Yunhao Wang
Shumin Han
Ping Luo
Gang Zeng
Jingdong Wang
SSL
378
438
0
07 Feb 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationInternational Conference on Machine Learning (ICML), 2022
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLMBDLVLMCLIP
1.2K
5,514
0
28 Jan 2022
A Simple Baseline for Open-Vocabulary Semantic Segmentation with
  Pre-trained Vision-language Model
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language ModelEuropean Conference on Computer Vision (ECCV), 2021
Mengde Xu
Zheng Zhang
Fangyun Wei
Yutong Lin
Yue Cao
Han Hu
Xiang Bai
VLM
279
276
0
29 Dec 2021
SLIP: Self-supervision meets Language-Image Pre-training
SLIP: Self-supervision meets Language-Image Pre-trainingEuropean Conference on Computer Vision (ECCV), 2021
Norman Mu
Alexander Kirillov
David Wagner
Saining Xie
VLMCLIP
316
557
0
23 Dec 2021
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Chen Wei
Haoqi Fan
Saining Xie
Chaoxia Wu
Alan Yuille
Christoph Feichtenhofer
ViT
409
767
0
16 Dec 2021
RegionCLIP: Region-based Language-Image Pretraining
RegionCLIP: Region-based Language-Image Pretraining
Yiwu Zhong
Jianwei Yang
Pengchuan Zhang
Chunyuan Li
Noel Codella
...
Luowei Zhou
Xiyang Dai
Lu Yuan
Yin Li
Jianfeng Gao
VLMCLIP
287
735
0
16 Dec 2021
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
419
815
0
02 Dec 2021
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Yongming Rao
Wenliang Zhao
Guangyi Chen
Yansong Tang
Zheng Zhu
Guan Huang
Jie Zhou
Jiwen Lu
VLMCLIP
388
699
0
02 Dec 2021
iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou
Chen Wei
Huiyu Wang
Wei Shen
Cihang Xie
Alan Yuille
Tao Kong
281
895
0
15 Nov 2021
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersComputer Vision and Pattern Recognition (CVPR), 2021
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViTTPM
1.5K
9,694
0
11 Nov 2021
Do Vision Transformers See Like Convolutional Neural Networks?
Do Vision Transformers See Like Convolutional Neural Networks?
M. Raghu
Thomas Unterthiner
Simon Kornblith
Chiyuan Zhang
Alexey Dosovitskiy
ViT
326
1,171
0
19 Aug 2021
FastSHAP: Real-Time Shapley Value Estimation
FastSHAP: Real-Time Shapley Value EstimationInternational Conference on Learning Representations (ICLR), 2021
N. Jethani
Mukund Sudarshan
Ian Covert
Su-In Lee
Rajesh Ranganath
TDIFAtt
315
161
0
15 Jul 2021
Early Convolutions Help Transformers See Better
Early Convolutions Help Transformers See BetterNeural Information Processing Systems (NeurIPS), 2021
Tete Xiao
Mannat Singh
Eric Mintun
Trevor Darrell
Piotr Dollár
Ross B. Girshick
298
869
0
28 Jun 2021
BEiT: BERT Pre-Training of Image Transformers
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao
Li Dong
Songhao Piao
Furu Wei
ViT
716
3,294
0
15 Jun 2021
Knowledge distillation: A good teacher is patient and consistent
Knowledge distillation: A good teacher is patient and consistentComputer Vision and Pattern Recognition (CVPR), 2021
Lucas Beyer
Xiaohua Zhai
Amelie Royer
L. Markeeva
Rohan Anil
Alexander Kolesnikov
VLM
286
347
0
09 Jun 2021
Intriguing Properties of Vision Transformers
Intriguing Properties of Vision TransformersNeural Information Processing Systems (NeurIPS), 2021
Muzammal Naseer
Kanchana Ranasinghe
Salman Khan
Munawar Hayat
Fahad Shahbaz Khan
Ming-Hsuan Yang
ViT
487
731
0
21 May 2021
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Mathilde Caron
Hugo Touvron
Ishan Misra
Edouard Grave
Julien Mairal
Piotr Bojanowski
Armand Joulin
1.8K
7,593
0
29 Apr 2021
ImageNet-21K Pretraining for the Masses
ImageNet-21K Pretraining for the Masses
T. Ridnik
Emanuel Ben-Baruch
Asaf Noy
Lihi Zelnik-Manor
SSegVLMCLIP
625
831
0
22 Apr 2021
All Tokens Matter: Token Labeling for Training Better Vision
  Transformers
All Tokens Matter: Token Labeling for Training Better Vision TransformersNeural Information Processing Systems (NeurIPS), 2021
Zihang Jiang
Qibin Hou
Li-xin Yuan
Daquan Zhou
Yujun Shi
Xiaojie Jin
Anran Wang
Jiashi Feng
ViT
321
233
0
22 Apr 2021
An Empirical Study of Training Self-Supervised Vision Transformers
An Empirical Study of Training Self-Supervised Vision TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Xinlei Chen
Saining Xie
Kaiming He
ViT
504
2,134
0
05 Apr 2021
Towards General Purpose Vision Systems
Towards General Purpose Vision SystemsComputer Vision and Pattern Recognition (CVPR), 2021
Tanmay Gupta
Amita Kamath
Aniruddha Kembhavi
Derek Hoiem
210
55
0
01 Apr 2021
Going deeper with Image Transformers
Going deeper with Image TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Hugo Touvron
Matthieu Cord
Alexandre Sablayrolles
Gabriel Synnaeve
Edouard Grave
ViT
440
1,162
0
31 Mar 2021
CvT: Introducing Convolutions to Vision Transformers
CvT: Introducing Convolutions to Vision TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Haiping Wu
Bin Xiao
Noel Codella
Xiyang Dai
Xiyang Dai
Lu Yuan
Lei Zhang
ViT
335
2,210
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsIEEE International Conference on Computer Vision (ICCV), 2021
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
1.1K
27,337
0
25 Mar 2021
ConViT: Improving Vision Transformers with Soft Convolutional Inductive
  Biases
ConViT: Improving Vision Transformers with Soft Convolutional Inductive BiasesInternational Conference on Machine Learning (ICML), 2021
Stéphane dÁscoli
Hugo Touvron
Matthew L. Leavitt
Ari S. Morcos
Giulio Biroli
Levent Sagun
ViT
357
924
0
19 Mar 2021
OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene
  Grounding
OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene GroundingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Ke-Jyun Wang
Yun-Hsuan Liu
Hung-Ting Su
Jen-Wei Wang
Yu-Siang Wang
Winston H. Hsu
Wen-Chin Chen
152
24
0
13 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language SupervisionInternational Conference on Machine Learning (ICML), 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
1.9K
39,376
0
26 Feb 2021
Re-labeling ImageNet: from Single to Multi-Labels, from Global to
  Localized Labels
Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized LabelsComputer Vision and Pattern Recognition (CVPR), 2021
Sangdoo Yun
Seong Joon Oh
Byeongho Heo
Dongyoon Han
Junsuk Choe
Sanghyuk Chun
773
163
0
13 Jan 2021
Explaining by Removing: A Unified Framework for Model Explanation
Explaining by Removing: A Unified Framework for Model ExplanationJournal of machine learning research (JMLR), 2020
Ian Covert
Scott M. Lundberg
Su-In Lee
FAtt
301
293
0
21 Nov 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
1.2K
53,038
0
22 Oct 2020
Shapley explainability on the data manifold
Shapley explainability on the data manifoldInternational Conference on Learning Representations (ICLR), 2020
Christopher Frye
Damien de Mijolla
T. Begley
Laurence Cowton
Megan Stanley
Ilya Feige
FAttTDI
361
113
0
01 Jun 2020
Previous
123
Next