Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse barcode DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
View on arXiv@article{gong2025_2405.17537, title={ CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale }, author={ ZeMing Gong and Austin T. Wang and Xiaoliang Huo and Joakim Bruslund Haurum and Scott C. Lowe and Graham W. Taylor and Angel X. Chang }, journal={arXiv preprint arXiv:2405.17537}, year={ 2025 } }