ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.01167
  4. Cited By
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
v1v2 (latest)

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

3 March 2025
Haoxin Li
Boyang Li
    CoGe
ArXiv (abs)PDFHTML

Papers citing "Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data"

50 / 103 papers shown
Title
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
  Vision-Language Negatives
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
Maitreya Patel
Abhiram Kusumba
Sheng Cheng
Changhoon Kim
Tejas Gokhale
Chitta Baral
Yezhou Yang
CoGeCLIP
121
14
0
04 Nov 2024
Natural Language Inference Improves Compositionality in Vision-Language
  Models
Natural Language Inference Improves Compositionality in Vision-Language Models
Paola Cascante-Bonilla
Yu Hou
Yang Trista Cao
Hal Daumé III
Rachel Rudinger
ReLMCoGeVLM
76
4
0
29 Oct 2024
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
  Vision-Language Models
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua
Yunlong Tang
Ziyun Zeng
Liangliang Cao
Zhengyuan Yang
Hangfeng He
Chenliang Xu
Jiebo Luo
VLMCoGe
70
13
0
13 Oct 2024
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving
  Vision-Linguistic Compositionality
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
Youngtaek Oh
Jae-Won Cho
Dong-Jin Kim
In So Kweon
Junmo Kim
VLMCoGeCLIP
96
6
0
07 Oct 2024
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and
  Lexical Alterations
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala
Aman Jaiswal
Chandramouli Shama Sastry
E. Milios
Sageev Oore
Hassan Sajjad
CoGe
80
12
0
17 Jun 2024
FFF: Fixing Flawed Foundations in contrastive pre-training results in
  very strong Vision-Language models
FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
VLM
97
5
0
16 May 2024
Iterated Learning Improves Compositionality in Large Vision-Language
  Models
Iterated Learning Improves Compositionality in Large Vision-Language Models
Chenhao Zheng
Jieyu Zhang
Aniruddha Kembhavi
Ranjay Krishna
VLMCoGe
79
12
0
02 Apr 2024
Uncovering the Text Embedding in Text-to-Image Diffusion Models
Uncovering the Text Embedding in Text-to-Image Diffusion Models
Huikang Yu
Hao Luo
Fan Wang
Feng Zhao
54
10
0
01 Apr 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via
  Negations
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
Jaisidh Singh
Ishaan Shrivastava
Mayank Vatsa
Richa Singh
Aparna Bharati
VLMCoGe
68
20
0
29 Mar 2024
ReNoise: Real Image Inversion Through Iterative Noising
ReNoise: Real Image Inversion Through Iterative Noising
Daniel Garibi
Or Patashnik
Andrey Voynov
Hadar Averbuch-Elor
Daniel Cohen-Or
DiffM
96
57
0
21 Mar 2024
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser
Sumith Kulal
A. Blattmann
Rahim Entezari
Jonas Muller
...
Zion English
Kyle Lacey
Alex Goodwin
Yannik Marek
Robin Rombach
DiffM
288
1,388
0
05 Mar 2024
CLoVe: Encoding Compositional Language in Contrastive Vision-Language
  Models
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro
Amir Ziai
Avneesh Saluja
Zhuoning Yuan
Rada Mihalcea
MLLMCoGeVLM
72
6
0
22 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic
  Compositional Reasoning via Counterfactual Examples
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
79
23
0
20 Feb 2024
Improving fine-grained understanding in image-text pre-training
Improving fine-grained understanding in image-text pre-training
Ioana Bica
Anastasija Ilić
Matthias Bauer
Goker Erdogan
Matko Bovsnjak
...
A. Gritsenko
Matthias Minderer
Charles Blundell
Razvan Pascanu
Jovana Mitrović
VLM
53
27
0
18 Jan 2024
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
S. DarshanSingh
Zeeshan Khan
Makarand Tapaswi
VLMCLIP
63
3
0
15 Jan 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
VLMMLLM
105
347
0
11 Jan 2024
Misalign, Contrast then Distill: Rethinking Misalignments in
  Language-Image Pretraining
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
82
8
0
19 Dec 2023
Style Aligned Image Generation via Shared Attention
Style Aligned Image Generation via Shared Attention
Amir Hertz
Andrey Voynov
Shlomi Fruchter
Daniel Cohen-Or
DiffM
55
135
0
04 Dec 2023
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language
  Understanding
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
Wujian Peng
Sicheng Xie
Zuyao You
Shiyi Lan
Zuxuan Wu
VLMCoGeMLLM
72
24
0
30 Nov 2023
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra
Brandon Huang
Trevor Darrell
Roei Herzig
MLLMLRM
84
98
0
27 Nov 2023
Enhancing Multimodal Compositional Reasoning of Visual Language Models
  with Generative Negative Mining
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
U. Sahin
Hang Li
Qadeer Ahmad Khan
Daniel Cremers
Volker Tresp
VLMCoGe
70
13
0
07 Nov 2023
What's "up" with vision-language models? Investigating their struggle
  with spatial reasoning
What's "up" with vision-language models? Investigating their struggle with spatial reasoning
Amita Kamath
Jack Hessel
Kai-Wei Chang
LRMCoGe
63
117
0
30 Oct 2023
Latent Consistency Models: Synthesizing High-Resolution Images with
  Few-Step Inference
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo
Yiqin Tan
Longbo Huang
Jian Li
Hang Zhao
DiffM
92
476
0
06 Oct 2023
EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods
EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods
Samyadeep Basu
Mehrdad Saberi
S. Bhardwaj
Atoosa Malemir Chegini
Daniela Massiceti
Maziar Sanjabi
S. Hu
Soheil Feizi
86
22
0
03 Oct 2023
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
  Pretraining?
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
Fei Wang
Liang Ding
Jun Rao
Ye Liu
Li Shen
Changxing Ding
69
15
0
24 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
396
12,044
0
18 Jul 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language
  Compositionality
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
Cheng-Yu Hsieh
Jieyu Zhang
Zixian Ma
Aniruddha Kembhavi
Ranjay Krishna
CoGe
110
132
0
26 Jun 2023
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based
  Image Editing
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
Yujun Shi
Chuhui Xue
Jun Hao Liew
Jiachun Pan
Hanshu Yan
Wenqing Zhang
Vincent Y. F. Tan
Song Bai
106
220
0
26 Jun 2023
DesCo: Learning Object Recognition with Rich Language Descriptions
DesCo: Learning Object Recognition with Rich Language Descriptions
Liunian Harold Li
Zi-Yi Dou
Nanyun Peng
Kai-Wei Chang
ObjDVLM
64
22
0
24 Jun 2023
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
  Representation Learners
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
Yonglong Tian
Lijie Fan
Phillip Isola
Huiwen Chang
Dilip Krishnan
VLMDiffM
107
152
0
01 Jun 2023
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL
  Models
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Roei Herzig
Donghyun Kim
...
Yikang Shen
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLMCoGe
103
58
0
31 May 2023
Learning to Imagine: Visually-Augmented Natural Language Generation
Learning to Imagine: Visually-Augmented Natural Language Generation
Tianyi Tang
Yushuo Chen
Yifan Du
Junyi Li
Wayne Xin Zhao
Ji-Rong Wen
DiffM
64
9
0
26 May 2023
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
  Improved Vision-Language Compositionality
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Harman Singh
Pengchuan Zhang
Qifan Wang
Mengjiao MJ Wang
Wenhan Xiong
Jingfei Du
Yu Chen
CoGeVLM
76
26
0
23 May 2023
Incorporating Structured Representations into Pretrained Vision &
  Language Models Using Scene Graphs
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Roei Herzig
Alon Mendelson
Leonid Karlinsky
Assaf Arbelle
Rogerio Feris
Trevor Darrell
Amir Globerson
VLM
83
33
0
10 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
  Structured Representations
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
92
25
0
06 May 2023
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
  Language Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu
Jun Chen
Xiaoqian Shen
Xiang Li
Mohamed Elhoseiny
VLMMLLM
165
2,064
0
20 Apr 2023
Synthetic Data from Diffusion Models Improves ImageNet Classification
Synthetic Data from Diffusion Models Improves ImageNet Classification
Shekoofeh Azizi
Simon Kornblith
Chitwan Saharia
Mohammad Norouzi
David J. Fleet
VLMDiffM
103
315
0
17 Apr 2023
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Paola Cascante-Bonilla
Khaled Shehada
James Smith
Sivan Doveh
Donghyun Kim
...
Gül Varol
A. Oliva
Vicente Ordonez
Rogerio Feris
Leonid Karlinsky
VLMSyDa
90
42
0
30 Mar 2023
Sigmoid Loss for Language Image Pre-Training
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIPVLM
248
1,200
0
27 Mar 2023
Twin Contrastive Learning with Noisy Labels
Twin Contrastive Learning with Noisy Labels
Zhizhong Huang
Junping Zhang
Hongming Shan
NoLa
48
58
0
13 Mar 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLMMLLM
429
4,642
0
30 Jan 2023
Muse: Text-To-Image Generation via Masked Generative Transformers
Muse: Text-To-Image Generation via Masked Generative Transformers
Huiwen Chang
Han Zhang
Jarred Barber
AJ Maschinot
José Lezama
...
Kevin Patrick Murphy
William T. Freeman
Michael Rubinstein
Yuanzhen Li
Dilip Krishnan
DiffM
269
556
0
02 Jan 2023
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang
Yeganeh Kordi
Swaroop Mishra
Alisa Liu
Noah A. Smith
Daniel Khashabi
Hannaneh Hajishirzi
ALMSyDaLRM
144
2,247
0
20 Dec 2022
Unnatural Instructions: Tuning Language Models with (Almost) No Human
  Labor
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Or Honovich
Thomas Scialom
Omer Levy
Timo Schick
ALM
126
375
0
19 Dec 2022
Fake it till you make it: Learning transferable representations from
  synthetic ImageNet clones
Fake it till you make it: Learning transferable representations from synthetic ImageNet clones
Mert Bulent Sariyildiz
Alahari Karteek
Diane Larlus
Yannis Kalantidis
DiffMVLM
91
160
0
16 Dec 2022
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Zixian Ma
Jerry Hong
Mustafa Omer Gul
Mona Gandhi
Irena Gao
Ranjay Krishna
CoGe
79
141
0
13 Dec 2022
SINE: SINgle Image Editing with Text-to-Image Diffusion Models
SINE: SINgle Image Editing with Text-to-Image Diffusion Models
Zhixing Zhang
Ligong Han
Arna Ghosh
Dimitris N. Metaxas
Jian Ren
DiffM
141
160
0
08 Dec 2022
Teaching Structured Vision&Language Concepts to Vision&Language Models
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Yikang Shen
Roei Herzig
...
Donghyun Kim
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLMCoGe
98
72
0
21 Nov 2022
InstructPix2Pix: Learning to Follow Image Editing Instructions
InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks
Aleksander Holynski
Alexei A. Efros
DiffM
209
1,830
0
17 Nov 2022
Tuning Language Models as Training Data Generators for
  Augmentation-Enhanced Few-Shot Learning
Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning
Yu Meng
Martin Michalski
Jiaxin Huang
Yu Zhang
Tarek Abdelzaher
Jiawei Han
VLM
118
49
0
06 Nov 2022
123
Next