Do Vision and Language Encoders Represent the World Similarly?

Do Vision and Language Encoders Represent the World Similarly?

10 January 2024

Mayug Maniparambil

Raiymbek Akshulakov

Y. A. D. Djilali

Sanath Narayan

Noel E. O'Connor

Papers citing "Do Vision and Language Encoders Represent the World Similarly?"

17 / 17 papers shown

Title
LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations Yile Wang Zhanyu Shen Hui Huang 26 0 0 15 May 2025
Spingarn's Method and Progressive Decoupling Beyond Elicitable Monotonicity B. Evens P. Latafat Panagiotis Patrinos 48 0 0 01 Apr 2025
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models Jianing Qi Jiawei Liu Hao Tang Zhigang Zhu 104 1 0 21 Mar 2025
On the Internal Representations of Graph Metanetworks Taesun Yeom Jaeho Lee GNN 59 0 0 12 Mar 2025
Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces Souhail Hadgi Luca Moschella Andrea Santilli Diego Gomez Qixing Huang Emanuele Rodolà Simone Melzi M. Ovsjanikov 40 0 0 07 Mar 2025
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities Zhaofeng Wu Xinyan Velocity Yu Dani Yogatama Jiasen Lu Yoon Kim AIFin 54 10 0 07 Nov 2024
Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models Orchid Chetia Phukan Sarthak Jain Swarup Ranjan Behera Arun Balaji Buduru Rajesh Sharma S. R Mahadeva Prasanna 28 0 0 21 Sep 2024
The Platonic Representation Hypothesis Minyoung Huh Brian Cheung Tongzhou Wang Phillip Isola 77 111 0 13 May 2024
Similarity of Neural Network Models: A Survey of Functional and Representational Measures Max Klabunde Tobias Schumacher M. Strohmaier Florian Lemmerich 52 64 0 10 May 2023
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training Antonio Norelli Marco Fumero Valentino Maiorca Luca Moschella Emanuele Rodolà Francesco Locatello VLM 81 33 0 04 Oct 2022
Linearly Mapping from Image to Text Space Jack Merullo Louis Castricato Carsten Eickhoff Ellie Pavlick VLM 164 104 0 30 Sep 2022
Emerging Properties in Self-Supervised Vision Transformers Mathilde Caron Hugo Touvron Ishan Misra Hervé Jégou Julien Mairal Piotr Bojanowski Armand Joulin 317 5,785 0 29 Apr 2021
The Power of Scale for Parameter-Efficient Prompt Tuning Brian Lester Rami Al-Rfou Noah Constant VPVLM 280 3,848 0 18 Apr 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts Soravit Changpinyo P. Sharma Nan Ding Radu Soricut VLM 278 1,082 0 17 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Chao Jia Yinfei Yang Ye Xia Yi-Ting Chen Zarana Parekh Hieu H. Pham Quoc V. Le Yun-hsuan Sung Zhen Li Tom Duerig VLM CLIP 298 3,700 0 11 Feb 2021
Similarity Analysis of Contextual Word Representation Models John M. Wu Yonatan Belinkov Hassan Sajjad Nadir Durrani Fahim Dalvi James R. Glass 51 73 0 03 May 2020
Word Translation Without Parallel Data Alexis Conneau Guillaume Lample MarcÁurelio Ranzato Ludovic Denoyer Hervé Jégou 174 1,635 0 11 Oct 2017