FOIL it! Find One mismatch between Image and Language caption

3 May 2017

Papers citing "FOIL it! Find One mismatch between Image and Language caption"

28 / 28 papers shown

Title
TULIP: Towards Unified Language-Image Pretraining Zineng Tang Long Lian Seun Eisape Xudong Wang Roei Herzig Adam Yala Alane Suhr Trevor Darrell David M. Chan VLM CLIP MLLM 103 3 0 19 Mar 2025
MASS: Overcoming Language Bias in Image-Text Matching Jiwan Chung Seungwon Lim Sangkyu Lee Youngjae Yu VLM 32 0 0 20 Jan 2025
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation Michal Golovanevsky William Rudman Vedant Palit Ritambhara Singh Carsten Eickhoff 33 1 0 24 Jun 2024
Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models A. Bavaresco A. Testoni Raquel Fernández 36 2 0 31 May 2024
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? Letitia Parcalabescu Anette Frank MLLM CoGe VLM 84 3 0 29 Apr 2024
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning Yuiga Wada Kanta Kaneda Daichi Saito Komei Sugiura 34 24 0 28 Feb 2024
Data Augmentations for Improved (Large) Language Model Generalization Amir Feder Yoav Wald Claudia Shi S. Saria David M. Blei OOD CML 32 7 0 19 Oct 2023
Scalable Performance Analysis for Vision-Language Models Santiago Castro Oana Ignat Rada Mihalcea VLM 35 1 0 30 May 2023
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics Saba Ahmadi Aishwarya Agrawal 30 6 0 24 May 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining Emanuele Bugliarello Aida Nematzadeh Lisa Anne Hendricks SSL 30 5 0 23 May 2023
Mutual Information Divergence: A Unified Metric for Multimodal Generative Models Jin-Hwa Kim Yunji Kim Jiyoung Lee Kang Min Yoo Sang-Woo Lee EGVM 36 32 0 25 May 2022
Can Visual Dialogue Models Do Scorekeeping? Exploring How Dialogue Representations Incrementally Encode Shared Knowledge Brielen Madureira David Schlangen 25 4 0 14 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Tristan Thrush Ryan Jiang Max Bartolo Amanpreet Singh Adina Williams Douwe Kiela Candace Ross CoGe 34 401 0 07 Apr 2022
CLIP-Event: Connecting Text and Images with Event Structures Manling Li Ruochen Xu Shuohang Wang Luowei Zhou Xudong Lin Chenguang Zhu Michael Zeng Heng Ji Shih-Fu Chang VLM CLIP 27 123 0 13 Jan 2022
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching Yaya Shi Xu Yang Haiyang Xu Chunfen Yuan Bing Li Weiming Hu Zhengjun Zha 39 33 0 17 Nov 2021
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond Amir Feder Katherine A. Keith Emaad A. Manzoor Reid Pryzant Dhanya Sridhar ... Roi Reichart Margaret E. Roberts Brandon M Stewart Victor Veitch Diyi Yang CML 41 234 0 02 Sep 2021
QACE: Asking Questions to Evaluate an Image Caption Hwanhee Lee Thomas Scialom Seunghyun Yoon Franck Dernoncourt Kyomin Jung CoGe 25 18 0 28 Aug 2021
Probing Image-Language Transformers for Verb Understanding Lisa Anne Hendricks Aida Nematzadeh 30 114 0 16 Jun 2021
CLIPScore: A Reference-free Evaluation Metric for Image Captioning Jack Hessel Ari Holtzman Maxwell Forbes Ronan Le Bras Yejin Choi CLIP 17 1,442 0 18 Apr 2021
Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case Adam Dahlgren Lindström Suna Bensch Johanna Björklund F. Drewes 24 20 0 22 Feb 2021
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks Letitia Parcalabescu Albert Gatt Anette Frank Iacer Calixto LRM 33 48 0 22 Dec 2020
Evaluating Models' Local Decision Boundaries via Contrast Sets Matt Gardner Yoav Artzi Victoria Basmova Jonathan Berant Ben Bogin ... Sanjay Subramanian Reut Tsarfaty Eric Wallace Ally Zhang Ben Zhou ELM 43 84 0 06 Apr 2020
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Jiasen Lu Dhruv Batra Devi Parikh Stefan Lee SSL VLM 70 3,624 0 06 Aug 2019
Evaluating the Representational Hub of Language and Vision Models Ravi Shekhar Ece Takmaz Raquel Fernández Raffaella Bernardi 30 11 0 12 Apr 2019
Pre-gen metrics: Predicting caption quality metrics without generating captions Marc Tanti Albert Gatt K. Camilleri 26 2 0 12 Oct 2018
How clever is the FiLM model, and how clever can it be? A. Kuhnle Huiyuan Xie Ann A. Copestake 22 6 0 09 Sep 2018
Targeted Syntactic Evaluation of Language Models Rebecca Marvin Tal Linzen 31 406 0 27 Aug 2018
Defoiling Foiled Image Captions Pranava Madhyastha Josiah Wang Lucia Specia 27 9 0 16 May 2018