125
7

Bolt3D: Generating 3D Scenes in Seconds

Abstract

We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

View on arXiv
@article{szymanowicz2025_2503.14445,
  title={ Bolt3D: Generating 3D Scenes in Seconds },
  author={ Stanislaw Szymanowicz and Jason Y. Zhang and Pratul Srinivasan and Ruiqi Gao and Arthur Brussee and Aleksander Holynski and Ricardo Martin-Brualla and Jonathan T. Barron and Philipp Henzler },
  journal={arXiv preprint arXiv:2503.14445},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.