ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.22980
50
0

MOVi: Training-free Text-conditioned Multi-Object Video Generation

29 May 2025
Aimon Rahman
Jiang Liu
Ze Wang
Ximeng Sun
Jialian Wu
Xiaodong Yu
Yusheng Su
Vishal M. Patel
Zicheng Liu
Emad Barsoum
    DiffMVGen
ArXiv (abs)PDFHTML
Main:12 Pages
9 Figures
Bibliography:4 Pages
7 Tables
Abstract

Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.

View on arXiv
@article{rahman2025_2505.22980,
  title={ MOVi: Training-free Text-conditioned Multi-Object Video Generation },
  author={ Aimon Rahman and Jiang Liu and Ze Wang and Ximeng Sun and Jialian Wu and Xiaodong Yu and Yusheng Su and Vishal M. Patel and Zicheng Liu and Emad Barsoum },
  journal={arXiv preprint arXiv:2505.22980},
  year={ 2025 }
}
Comments on this paper