HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

16 June 2025

Main:9 Pages

4 Figures

Bibliography:3 Pages

3 Tables

Abstract

Semi-supervised semantic segmentation remains challenging under severe label scarcity and domain variability. Vision-only methods often struggle to generalize, resulting in pixel misclassification between similar classes, poor generalization and boundary localization. Vision-Language Models offer robust, domain-invariant semantics but lack the spatial grounding required for dense prediction. We introduce HierVL, a unified framework that bridges this gap by integrating abstract text embeddings into a mask-transformer architecture tailored for semi-supervised segmentation. HierVL features three novel components: a Hierarchical Semantic Query Generator that filters and projects abstract class embeddings into multi-scale queries to suppress irrelevant classes and handle intra-class variability; a Cross-Modal Spatial Alignment Module that aligns semantic queries with pixel features for sharper boundaries under sparse supervision; and a Dual-Query Transformer Decoder that fuses semantic and instance-level queries to prevent instance collapse. We also introduce targeted regularization losses that maintain vision-language alignment throughout training to reinforce semantic grounding. HierVL establishes a new state-of-the-art by achieving a +4.4% mean improvement of the intersection over the union on COCO (with 232 labeled images), +3.1% on Pascal VOC (with 92 labels), +5.9% on ADE20 (with 158 labels) and +1.8% on Cityscapes (with 100 labels), demonstrating better performance under 1% supervision on four benchmark datasets. Our results show that language-guided segmentation closes the label efficiency gap and unlocks new levels of fine-grained, instance-aware generalization.

View on arXiv

@article{nadeem2025_2506.13925,
  title={ HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment },
  author={ Numair Nadeem and Saeed Anwar and Muhammad Hamza Asad and Abdul Bais },
  journal={arXiv preprint arXiv:2506.13925},
  year={ 2025 }
}

Comments on this paper