KaLM-Embedding

Overview | Features | Usage | Acknowledgements | Citation | License

✨ Overview

Code for training and evaluation of our KaLM-Embedding models.

Pretraining data: HIT-TMG/KaLM-embedding-pretrain-data

Technical Reports: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model and KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.

KaLM-Embedding-V1.5: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5

KaLM-Embedding-V2: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2

⚡ Features

Training
- Ranking Consistency Filtering
- Semi-homogeneous Task Batching
- Matryoshka Representation Learning
Evaluation
- Multi-GPU Asynchronous Computation

💻 Usage

🌈 Environment:

conda env create -f environment.yaml
conda activate kalm

⛏️ Hard-negative Mining (with Filtering):

bash ./scripts/hn_mine.sh

You can customize the filter_topk parameter to set the threshold of ranking consistency filtering.

🔥 Training:

bash ./scripts/train.sh

📊 Evaluation:

We have provided a code for evaluating MTEB using multiple GPUs, which allocates each task from the task set to a single GPU in a queue-based manner, thereby enhancing evaluation efficiency.

bash ./scripts/eval_mteb.sh

🔍 Results

Below, we present a portion of the results from the MTEB study. For a more comprehensive analysis, please refer to our technical report.

Overall results on MTEB (cmn, v1) and MTEB (eng, v1).

Detailed model performance on MTEB (cmn, v1).

Detailed model performance on MTEB (eng, v1).

📢 Acknowledgements

Specifically, our training code was forked from FlagOpen/FlagEmbedding. We have made modifications to suit our specific needs, but the core functionality and structure are derived from their excellent work. Please check out their repository for more details!

🔗 Citation

If you find this model useful, please consider giving a star and citation.

@misc{zhao2025kalmembeddingv2,
      title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model}, 
      author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Qian Chen and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang},
      year={2025},
      eprint={2506.20923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.20923}, 
}

@misc{hu2025kalmembedding,
  title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model}, 
  author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang},
  year={2025},
  eprint={2501.01028},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.01028}, 
}

📜 License

This repository respects to MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
evaluate		evaluate
imgs		imgs
scipts		scipts
train		train
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KaLM-Embedding

Overview | Features | Usage | Acknowledgements | Citation | License

✨ Overview

⚡ Features

💻 Usage

🌈 Environment:

⛏️ Hard-negative Mining (with Filtering):

🔥 Training:

📊 Evaluation:

🔍 Results

Overall results on MTEB (cmn, v1) and MTEB (eng, v1).

Detailed model performance on MTEB (cmn, v1).

Detailed model performance on MTEB (eng, v1).

📢 Acknowledgements

🔗 Citation

📜 License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

HITsz-TMG/KaLM-Embedding

Folders and files

Latest commit

History

Repository files navigation

KaLM-Embedding

Overview | Features | Usage | Acknowledgements | Citation | License

✨ Overview

⚡ Features

💻 Usage

🌈 Environment:

⛏️ Hard-negative Mining (with Filtering):

🔥 Training:

📊 Evaluation:

🔍 Results

Overall results on MTEB (cmn, v1) and MTEB (eng, v1).

Detailed model performance on MTEB (cmn, v1).

Detailed model performance on MTEB (eng, v1).

📢 Acknowledgements

🔗 Citation

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages