Code for training and evaluation of our KaLM-Embedding models.
Pretraining data: HIT-TMG/KaLM-embedding-pretrain-data
Technical Reports: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model and KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.
KaLM-Embedding-V1.5: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5
KaLM-Embedding-V2: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2
- Training
- Ranking Consistency Filtering
- Semi-homogeneous Task Batching
- Matryoshka Representation Learning
- Evaluation
- Multi-GPU Asynchronous Computation
conda env create -f environment.yaml
conda activate kalm
bash ./scripts/hn_mine.sh
You can customize the filter_topk
parameter to set the threshold of ranking consistency filtering.
bash ./scripts/train.sh
We have provided a code for evaluating MTEB using multiple GPUs, which allocates each task from the task set to a single GPU in a queue-based manner, thereby enhancing evaluation efficiency.
bash ./scripts/eval_mteb.sh
Below, we present a portion of the results from the MTEB study. For a more comprehensive analysis, please refer to our technical report.
Specifically, our training code was forked from FlagOpen/FlagEmbedding. We have made modifications to suit our specific needs, but the core functionality and structure are derived from their excellent work. Please check out their repository for more details!
If you find this model useful, please consider giving a star and citation.
@misc{zhao2025kalmembeddingv2,
title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model},
author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Qian Chen and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang},
year={2025},
eprint={2506.20923},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.20923},
}
@misc{hu2025kalmembedding,
title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model},
author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang},
year={2025},
eprint={2501.01028},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.01028},
}
This repository respects to MIT license.