ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.20923
267
2
v1v2v3v4v5 (latest)

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

26 June 2025
Xinping Zhao
Xinshuo Hu
Zifei Shan
Shouzheng Huang
Yao Zhou
Zetian Sun
Zhenyu Liu
Dongfang Li
Xinyuan Wei
Xinyuan Wei
Youcheng Pan
Yang Xiang
Meishan Zhang
Haofen Wang
Jun-chen Yu
Baotian Hu
Min Zhang
    VLM
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)Github (1167★)
Main:8 Pages
5 Figures
Bibliography:14 Pages
20 Tables
Appendix:10 Pages
Abstract

Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters.

View on arXiv
Comments on this paper