Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.03162
Cited By
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
7 April 2022
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality"
49 / 99 papers shown
Title
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
Wujian Peng
Sicheng Xie
Zuyao You
Shiyi Lan
Zuxuan Wu
VLM
CoGe
MLLM
38
18
0
30 Nov 2023
What's left can't be right -- The remaining positional incompetence of contrastive vision-language models
Nils Hoehing
Ellen Rushe
Anthony Ventresque
VLM
28
3
0
20 Nov 2023
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh
Ashish Seth
Sonal Kumar
Utkarsh Tyagi
Chandra Kiran Reddy Evuru
S. Ramaneswaran
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLM
VLM
CoGe
48
23
0
12 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Mustafa Shukor
Alexandre Ramé
Corentin Dancette
Matthieu Cord
LRM
MLLM
56
20
0
01 Oct 2023
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Palaash Agrawal
Haidi Azaman
Cheston Tan
56
3
0
13 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
36
20
0
27 Aug 2023
An Examination of the Compositionality of Large Generative Vision-Language Models
Teli Ma
Rong Li
Junwei Liang
CoGe
39
2
0
21 Aug 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
25
2
0
31 Jul 2023
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution
S. Hall
F. G. Abrantes
Hanwen Zhu
Grace A. Sodunke
Aleksandar Shtedritski
Hannah Rose Kirk
CoGe
39
39
0
21 Jun 2023
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
Rabiul Awal
Le Zhang
Aishwarya Agrawal
LRM
51
12
0
16 Jun 2023
Compositional diversity in visual concept learning
Yanli Zhou
Reuben Feinman
Brenden M. Lake
CoGe
OCL
34
8
0
30 May 2023
Scalable Performance Analysis for Vision-Language Models
Santiago Castro
Oana Ignat
Rada Mihalcea
VLM
40
1
0
30 May 2023
Modularized Zero-shot VQA with Pre-trained Models
Rui Cao
Jing Jiang
LRM
35
2
0
27 May 2023
Are Diffusion Models Vision-And-Language Reasoners?
Benno Krojer
Elinor Poole-Dayan
Vikram S. Voleti
Christopher Pal
Siva Reddy
45
13
0
25 May 2023
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi
Aishwarya Agrawal
30
6
0
24 May 2023
Text encoders bottleneck compositionality in contrastive vision-language models
Amita Kamath
Jack Hessel
Kai-Wei Chang
CoGe
CLIP
VLM
35
19
0
24 May 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
34
5
0
23 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
59
74
0
17 May 2023
COLA: A Benchmark for Compositional Text-to-image Retrieval
Arijit Ray
Filip Radenovic
Abhimanyu Dubey
Bryan A. Plummer
Ranjay Krishna
Kate Saenko
CoGe
VLM
47
35
0
05 May 2023
HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models
Eslam Mohamed Bakr
Pengzhan Sun
Xiaoqian Shen
Faizan Farooq Khan
Li Erran Li
Mohamed Elhoseiny
VLM
32
76
0
11 Apr 2023
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
Qiucheng Wu
Yujian Liu
Handong Zhao
T. Bui
Zhe Lin
Yang Zhang
Shiyu Chang
DiffM
47
45
0
07 Apr 2023
Your Diffusion Model is Secretly a Zero-Shot Classifier
Alexander C. Li
Mihir Prabhudesai
Shivam Duggal
Ellis L Brown
Deepak Pathak
DiffM
VLM
60
227
0
28 Mar 2023
IRFL: Image Recognition of Figurative Language
Ron Yosef
Yonatan Bitton
Dafna Shahaf
48
18
0
27 Mar 2023
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
Can Qin
Ning Yu
Chen Xing
Shu Zhen Zhang
Zeyuan Chen
Stefano Ermon
Yun Fu
Caiming Xiong
Ran Xu
DiffM
53
20
0
17 Mar 2023
Toward Unsupervised Realistic Visual Question Answering
Yuwei Zhang
Chih-Hui Ho
Nuno Vasconcelos
CoGe
27
2
0
09 Mar 2023
Teaching CLIP to Count to Ten
Roni Paiss
Ariel Ephrat
Omer Tov
Shiran Zada
Inbar Mosseri
Michal Irani
Tali Dekel
VLM
CLIP
41
93
0
23 Feb 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
95
36
0
05 Jan 2023
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Rohan Pandey
Rulin Shao
Paul Pu Liang
Ruslan Salakhutdinov
Louis-Philippe Morency
34
12
0
20 Dec 2022
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Martha Lewis
Nihal V. Nayak
Peilin Yu
Qinan Yu
Jack Merullo
Stephen H. Bach
Ellie Pavlick
VLM
OCL
CoGe
36
59
0
20 Dec 2022
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Letitia Parcalabescu
Anette Frank
45
22
0
15 Dec 2022
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Zixian Ma
Jerry Hong
Mustafa Omer Gul
Mona Gandhi
Irena Gao
Ranjay Krishna
CoGe
39
125
0
13 Dec 2022
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Weixi Feng
Xuehai He
Tsu-Jui Fu
Varun Jampani
Arjun Reddy Akula
P. Narayana
Sugato Basu
Xinze Wang
William Yang Wang
CoGe
56
300
0
09 Dec 2022
Abstract Visual Reasoning with Tangram Shapes
Anya Ji
Noriyuki Kojima
N. Rush
Alane Suhr
Wai Keen Vong
Robert D. Hawkins
Yoav Artzi
LRM
17
34
0
29 Nov 2022
Deep representation learning: Fundamentals, Perspectives, Applications, and Open Challenges
K. T. Baghaei
Amirreza Payandeh
Pooya Fayyazsanavi
Shahram Rahimi
Zhiqian Chen
Somayeh Bakhtiari Ramezani
FaML
AI4TS
40
6
0
27 Nov 2022
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Yikang Shen
Roei Herzig
...
Donghyun Kim
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLM
CoGe
61
71
0
21 Nov 2022
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
James Smith
Paola Cascante-Bonilla
Assaf Arbelle
Donghyun Kim
Yikang Shen
David D. Cox
Diyi Yang
Z. Kira
Rogerio Feris
Leonid Karlinsky
VLM
49
20
0
17 Nov 2022
Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches
Daniel Fried
Nicholas Tomlin
Jennifer Hu
Roma Patel
Aida Nematzadeh
29
6
0
15 Nov 2022
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Anuj Diwan
Layne Berry
Eunsol Choi
David Harwath
Kyle Mahowald
CoGe
119
41
0
01 Nov 2022
Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems
Wang Zhu
Jesse Thomason
Robin Jia
VLM
OOD
NAI
LRM
39
6
0
26 Oct 2022
FETA: Towards Specializing Foundation Models for Expert Task Applications
Amit Alfassy
Assaf Arbelle
Oshri Halimi
Sivan Harary
Roei Herzig
...
Christoph Auer
Kate Saenko
Peter W. J. Staar
Rogerio Feris
Leonid Karlinsky
28
19
0
08 Sep 2022
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Yogesh S Rawat
M. Shah
SSL
43
132
0
18 Jun 2022
Contrastive language and vision learning of general fashion concepts
P. Chia
Giuseppe Attanasio
Federico Bianchi
Silvia Terragni
A. Magalhães
Diogo Gonçalves
C. Greco
Jacopo Tagliabue
CLIP
26
42
0
08 Apr 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
29
103
0
16 Jan 2022
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
Edoardo Ponti
Siva Reddy
Nigel Collier
Desmond Elliott
VLM
LRM
116
171
0
28 Sep 2021
COVR: A test-bed for Visually Grounded Compositional Generalization with real images
Ben Bogin
Shivanshu Gupta
Matt Gardner
Jonathan Berant
CoGe
39
29
0
22 Sep 2021
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate
Hannah Rose Kirk
B. Vidgen
Paul Röttger
Tristan Thrush
Scott A. Hale
67
57
0
12 Aug 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
213
310
0
02 Mar 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
323
1,086
0
17 Feb 2021
UnNatural Language Inference
Koustuv Sinha
Prasanna Parthasarathi
Joelle Pineau
Adina Williams
228
80
0
30 Dec 2020
Previous
1
2