50
0

STRICT: Stress Test of Rendering Images Containing Text

Main:7 Pages
6 Figures
Bibliography:3 Pages
1 Tables
Appendix:3 Pages
Abstract

While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce STRICT\textbf{STRICT}, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline atthis https URL.

View on arXiv
@article{zhang2025_2505.18985,
  title={ STRICT: Stress Test of Rendering Images Containing Text },
  author={ Tianyu Zhang and Xinyu Wang and Zhenghan Tai and Lu Li and Jijun Chi and Jingrui Tian and Hailin He and Suyuchen Wang },
  journal={arXiv preprint arXiv:2505.18985},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.