18
0

Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models

Abstract

Recent research has highlighted the risk of generative model collapse, where performance progressively degrades when continually trained on self-generated data. However, existing exploration on model collapse is limited to single, unimodal models, limiting our understanding in more realistic scenarios, such as diverse multi-modal AI agents interacting autonomously through synthetic data and continually evolving. We expand the synthetic data training and model collapse study to multi-modal vision-language generative systems, such as vision-language models (VLMs) and text-to-image diffusion models, as well as recursive generate-train loops with multiple models. We find that model collapse, previously observed in single-modality generative models, exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in VLM image-captioning task. Additionally, we find that general approaches such as increased decoding budgets, greater model diversity, and relabeling with frozen models can effectively mitigate model collapse. Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self-improving multi-agent AI systems and curating robust multi-modal synthetic datasets.

View on arXiv
@article{hu2025_2505.08803,
  title={ Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models },
  author={ Zizhao Hu and Mohammad Rostami and Jesse Thomason },
  journal={arXiv preprint arXiv:2505.08803},
  year={ 2025 }
}
Comments on this paper