ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.12198
17
0

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

13 June 2025
Sibo Dong
Ismail Shaheen
Maggie Shen
Rupayan Mallick
Sarah Adel Bargal
    DiffM
ArXiv (abs)PDFHTML
Main:8 Pages
10 Figures
Bibliography:2 Pages
2 Tables
Appendix:5 Pages
Abstract

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.

View on arXiv
@article{dong2025_2506.12198,
  title={ ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models },
  author={ Sibo Dong and Ismail Shaheen and Maggie Shen and Rupayan Mallick and Sarah Adel Bargal },
  journal={arXiv preprint arXiv:2506.12198},
  year={ 2025 }
}
Comments on this paper