Robust Cross-View Geo-Localization via Content-Viewpoint Disentanglement

Cross-view geo-localization (CVGL) aims to match images of the same geographic location captured from different perspectives, such as drones and satellites. Despite recent advances, CVGL remains highly challenging due to significant appearance changes and spatial distortions caused by viewpoint variations. Existing methods typically assume that cross-view images can be directly aligned within a shared feature space by maximizing feature similarity through contrastive learning. Nonetheless, this assumption overlooks the inherent conflicts induced by viewpoint discrepancies, resulting in extracted features containing inconsistent information that hinders precise localization. In this study, we take a manifold learning perspective and model the feature space of cross-view images as a composite manifold jointly governed by content and viewpoint information. Building upon this insight, we propose , a new CVGL framework that explicitly disentangles and factors. To promote effective disentanglement, we introduce two constraints: An intra-view independence constraint, which encourages statistical independence between the two factors by minimizing their mutual information. An inter-view reconstruction constraint that reconstructs each view by cross-combining and from paired images, ensuring factor-specific semantics are preserved. As a plug-and-play module, CVD can be seamlessly integrated into existing geo-localization pipelines. Extensive experiments on four benchmarks, i.e., University-1652, SUES-200, CVUSA, and CVACT, demonstrate that CVD consistently improves both localization accuracy and generalization across multiple baselines.
View on arXiv@article{li2025_2505.11822, title={ Robust Cross-View Geo-Localization via Content-Viewpoint Disentanglement }, author={ Ke Li and Di Wang and Xiaowei Wang and Zhihong Wu and Yiming Zhang and Yifeng Wang and Quan Wang }, journal={arXiv preprint arXiv:2505.11822}, year={ 2025 } }