Moondream Segmentation: From Words to Masks

3 April 2026

Ethan Reid

ObjD

ISeg

VLM

ArXiv (abs)PDF HTML Github (8798★)

Main:8 Pages

12 Figures

Bibliography:2 Pages

3 Tables

Appendix:6 Pages

Abstract

We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

View on arXiv

Comments on this paper