Fast-SAM3D: 3Dfy Anything in Images but Faster

5 February 2026

Weilun Feng

Mingqiang Wu

Zhiliang Chen

Chuanguang Yang

Haotong Qin

Yuqi Li

Xiaokun Liu

Guoxin Fan

Zhulin An

Libo Huang

Yulun Zhang

Michele Magno

Yongjun Xu

3DGS

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)Github

Main:8 Pages

12 Figures

Bibliography:3 Pages

10 Tables

Appendix:7 Pages

Abstract

SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67 $\times$ } end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released inthis https URL.

View on arXiv

Comments on this paper