embedding.cpp is a text embedding tool via BERT
base model upon ggml.
embedding.cpp is a fork from bert.cpp.
Thanks to
bert.cpp
contributors. Here is also an original README.
Mention:
embedding.cpp
is stillWIP
and not ready for production.
- Plain C/C++ implementation without dependencies
- Inherit support for various architectures from ggml (x86 with AVX2, ARM, etc.)
- Choose your model size from 32/16/4 bits per model weigth
- all-MiniLM-L6-v2 with 4bit quantization is only 14MB. Inference RAM usage depends on the length of the input
- Sample cpp server over tcp socket and a python test client
- Benchmarks to validate correctness and speed of inference
- Build tokenizer with tokenizers-cpp.
- Can correctly handle asian writing (CJK, and so on).
- Can process cased/uncased with respect to origin config in
tokenizer.json
.
- Upgrade to use GGUF model file format. So it is easy to expand and keep compatible.
With above, we can run embedding.cpp with more models like m3e, e5 and so on.
- Only support bert base model for embedding. other architecture like SGPT is not supported.
- Only run on CPU.
- All outputs are mean pooled and normalized.
- Batching support is WIP.
- Lack of real batching means that this library is slower than it could be in usecases where you have multiple sentences.
git submodule update --init --recursive
By default, it build both
- the native binaries, like the example server, with static libraries;
- and the dynamic library for usage from e.g. Python.
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
cd ..
rust should be installed. see rust or run
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Converting models is similar to llama.cpp. Use models/convert-to-gguf.py to make hf models into either f32 or f16 gguf models. Then use ./build/bin/quantize to turn those into Q4_0, 4bit per weight models.
There is also models/run_conversions.sh which creates all 4 versions (f32, f16, Q4_0, Q4_1) at once.
pip install -r requirements.txt
cd models
# Clone a model from hf
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# Run conversions to 4 ggml formats (f32, f16, Q4_0, Q4_1)
sh run_conversions.sh all-MiniLM-L6-v2