Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets
Existing multimodal methods typically assume that different modalities share the same category set. However, in real-world applications, the category distributions in multimodal data exhibit inconsistencies, which can hinder the model's ability to effectively utilize cross-modal information for recognizing all categories. In this work, we propose the practical setting termed Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are trained in heterogeneous category sets of multi-modal data and aim to recognize complete classes set of all modalities during test. To effectively address this task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF). Specifically, CSCF aligns modality-specific features to a shared semantic space to enable knowledge transfer between seen and unseen classes. It then selects the most discriminative modality for decision fusion through uncertainty estimation. Finally, it integrates cross-modal information based on class similarity, where the auxiliary modality refines the prediction of the dominant one. Experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) approaches on multiple benchmark datasets, effectively addressing the MMHCL task.
View on arXiv@article{zhu2025_2506.09745, title={ Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets }, author={ Yangrui Zhu and Junhua Bao and Yipan Wei and Yapeng Li and Bo Du }, journal={arXiv preprint arXiv:2506.09745}, year={ 2025 } }