Skip to content

HITsz-TMG/KaLM-Embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KaLM-Embedding

kalm_logo

✨ Overview

Code for training and evaluation of our KaLM-Embedding models.

Pretraining data: HIT-TMG/KaLM-embedding-pretrain-data

Technical Reports: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model and KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.

KaLM-Embedding-V1.5: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5

KaLM-Embedding-V2: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2

⚡ Features

  • Training
    • Ranking Consistency Filtering
    • Semi-homogeneous Task Batching
    • Matryoshka Representation Learning
  • Evaluation
    • Multi-GPU Asynchronous Computation

💻 Usage

🌈 Environment:

conda env create -f environment.yaml
conda activate kalm

⛏️ Hard-negative Mining (with Filtering):

bash ./scripts/hn_mine.sh

You can customize the filter_topk parameter to set the threshold of ranking consistency filtering.

🔥 Training:

bash ./scripts/train.sh

📊 Evaluation:

We have provided a code for evaluating MTEB using multiple GPUs, which allocates each task from the task set to a single GPU in a queue-based manner, thereby enhancing evaluation efficiency.

bash ./scripts/eval_mteb.sh

🔍 Results

Below, we present a portion of the results from the MTEB study. For a more comprehensive analysis, please refer to our technical report.

Overall results on MTEB (cmn, v1) and MTEB (eng, v1).

overall

Detailed model performance on MTEB (cmn, v1).

mteb_cmn

Detailed model performance on MTEB (eng, v1).

mteb_cmn

📢 Acknowledgements

Specifically, our training code was forked from FlagOpen/FlagEmbedding. We have made modifications to suit our specific needs, but the core functionality and structure are derived from their excellent work. Please check out their repository for more details!

🔗 Citation

If you find this model useful, please consider giving a star and citation.

@misc{zhao2025kalmembeddingv2,
      title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model}, 
      author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Qian Chen and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang},
      year={2025},
      eprint={2506.20923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.20923}, 
}

@misc{hu2025kalmembedding,
  title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model}, 
  author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang},
  year={2025},
  eprint={2501.01028},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.01028}, 
}

📜 License

This repository respects to MIT license.