A C++ library for text (and maybe image) embeddings, focusing on efficient inference of BERT-like (and maybe clip-like) models.
Many existing GGML-based text embedding libraries have limited support for Chinese text processing due to their custom tokenizer implementations. This project addresses this limitation by leveraging Hugging Face's Rust tokenizer implementation, wrapped with a C++ API to ensure consistency with the Python transformers library while providing native performance.
While currently focused on BERT-like text embedding models, the project aims to support image embedding models in the future (Work in Progress).
Note: This is an experimental and educational project. It is not recommended for production use at this time.
The following models have been tested and verified:
- BAAI/bge-m3
 - BAAI/bge-base-zh-v1.5
 - shibing624/text2vec-base-multilingual
 
The C++ implementation shows high accuracy compared to the Python implementation, with differences in the order of 10^-9. For detailed comparison results, please refer to alignment.ipynb.
First, install the required dependencies:
pip install -r scripts/requirements.txtThen convert the models to GGUF format:
# Convert BGE-M3 model
python scripts/convert.py BAAI/bge-m3 ./models/bge-m3.fp16.gguf f16
# Convert BGE-Base Chinese v1.5 model
python scripts/convert.py BAAI/bge-base-zh-v1.5 ./models/bge-base-zh-v1.5.fp16.gguf f16
uv run scripts/convert.py Snowflake/snowflake-arctic-embed-m-v2.0 ./models/snowflake-arctic-embed-m-v2.0.fp16.gguf f16
# Convert Text2Vec multilingual model
python scripts/convert.py shibing624/text2vec-base-multilingual ./models/text2vec-base-multilingual.fp16.gguf f16
python scripts/convert.py sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 ./models/paraphrase-multilingual-MiniLM-L12-v2.fp16.gguf f16After converting models to GGUF format, you can quantize them to reduce memory usage and improve inference speed:
# Build the quantization tool
cmake --build build --target quantize
# Quantize a model (example with different quantization types)
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q4_k.gguf q4_k
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q6_k.gguf q6_k
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q8_0.gguf q8_0
# On Windows
.\build\Release\quantize.exe .\models\bge-m3.fp16.gguf .\models\bge-m3.q4_k.gguf q4_kq4_k: 4-bit quantization with K-means clustering (good balance of size and quality)q6_k: 6-bit quantization with K-means clustering (higher quality, larger size)q8_0: 8-bit quantization (minimal quality loss, moderate size reduction)- Other GGML quantization types as supported by the library
 
quantize <input_model.gguf> <output_model.gguf> <qtype>
The quantization tool will:
- Load the input GGUF model
 - Quantize eligible tensors (typically weight matrices)
 - Preserve metadata and non-quantizable tensors
 - Output size comparison and compression statistics
 
Before running, install embeddings.cpp:
# use CMAKE_ARGS to add more cmake settings
$env:CMAKE_ARGS="-DGGML_VULKAN=ON"
# Install the package
pip install .
# Generate Python stub files
cd build && make stub
# on Windows
pip install pybind11-stubgen
# then
pybind11-stubgen embeddings_cpp -o .
python tests/test_tokenizer.pyConfigure and build with Metal support:
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
      -DGGML_METAL=ON \
      -DGGML_METAL_EMBED_LIBRARY=ON \
      -DEMBEDDINGS_CPP_ENABLE_PYBIND=ON ..If you encountered openmp's bug, try
brew install libomp
export OpenMP_ROOT=$(brew --prefix)/opt/libomp
build with vulkan support:
cmake -DGGML_VULKAN=ON -DEMBEDDINGS_CPP_ENABLE_PYBIND=ON ..
# If you encounter any issues, ensure that your graphics driver and Vulkan SDK versions are compatible.
# You can also add -DGGML_VULKAN_DEBUG=ON -DGGML_VULKAN_VALIDATE=ON for debugingGGML debug support is now enabled by default in the vendored version. This provides better debugging capabilities for CPU backend operations without requiring additional patches.
For more information about GGML debugging features, see: ggml-org/ggml#655