211
0

Model alignment using inter-modal bridges

Main:9 Pages
24 Figures
Bibliography:6 Pages
4 Tables
Appendix:18 Pages
Abstract

Foundation models have demonstrated remarkable performance across modalities such as language and vision. However, model reuse across distinct modalities (e.g., text and vision) remains limited due to the difficulty of aligning internal representations. Existing methods require extensive paired training data or are constrained to specific domains. We introduce a semi-supervised approach for model alignment via conditional flow matching. The conditional flow between latent spaces of different modalities (e.g., text-to-image or biological-to-artificial neuronal activity) can be learned in two settings: (11) solving a (balanced or unbalanced) optimal transport problem with an inter-space bridge cost, and (22) performing memory-efficient alignment using labelled exemplars. Despite being constrained by the original models' capacity, our method--under both settings--matches downstream task performance of end-to-end trained models on object recognition and image generation tasks across MNIST, ImageNet, and \cite{majaj2015simple} datasets, particularly when labelled training data is scarce (<20%<20\%). Our method provides a data-efficient solution for inter-modal model alignment with minimal supervision.

View on arXiv
@article{gholamzadeh2025_2505.12322,
  title={ Model alignment using inter-modal bridges },
  author={ Ali Gholamzadeh and Noor Sajid },
  journal={arXiv preprint arXiv:2505.12322},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.