Skip to content

Commit e100501

Browse files
committed
Docs: Update README to feature LlamaEmbedding and Reranking workflows
- Added usage guide for the new `LlamaEmbedding` class. - Included code snippets for Reranking, Batching, and Normalization. - Updated legacy examples to reflect current best practices.
1 parent 5c424ab commit e100501

File tree

2 files changed

+121
-18
lines changed

2 files changed

+121
-18
lines changed

README.md

Lines changed: 120 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -730,46 +730,148 @@ print(res["choices"][0]["message"]["content"])
730730

731731
```
732732

733+
---
733734

734-
### Speculative Decoding
735+
## Embeddings & Reranking (GGUF)
735736

736-
`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.
737+
`llama-cpp-python` provides a high-performance, memory-efficient specialized class `LlamaEmbedding` for generating text embeddings and calculating reranking scores.
737738

738-
The fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.
739+
**Key Features:**
740+
* **Streaming Batch Processing:** Process massive datasets (e.g., Hundreds of documents) without running out of memory (OOM).
741+
* **Native Reranking:** Built-in support for Cross-Encoder models (outputting relevance scores instead of vectors).
742+
* **Optimized Performance:** Utilizes Unified KV Cache for parallel encoding of multiple documents.
739743

740-
Just pass this as a draft model to the `Llama` class during initialization.
744+
### TODO(JamePeng): Needs more extensive testing with various embedding and rerank models. :)
745+
746+
#### 1. Text Embeddings (Vector Search)
747+
748+
To generate embeddings, use the `LlamaEmbedding` class. It automatically configures the model for vector generation.
741749

742750
```python
743-
from llama_cpp import Llama
744-
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
751+
from llama_cpp.llama_embedding import LlamaEmbedding
745752

746-
llama = Llama(
747-
model_path="path/to/model.gguf",
748-
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
753+
# Initialize the model (automatically sets embedding=True)
754+
llm = LlamaEmbedding(model_path="path/to/bge-m3.gguf")
755+
756+
# 1. Simple usage (OpenAI-compatible format)
757+
response = llm.create_embedding("Hello, world!")
758+
print(response['data'][0]['embedding'])
759+
760+
# 2. Batch processing (High Performance)
761+
# You can pass a large list of strings; the streaming batcher handles memory automatically.
762+
documents = ["Hello, world!", "Goodbye, world!", "Llama is cute."] * 100
763+
embeddings = llm.embed(documents) # Returns a list of lists (vectors)
764+
765+
print(f"Generated {len(embeddings)} vectors.")
766+
```
767+
768+
**Advanced Output Formats:**
769+
You can request raw arrays or cosine similarity matrices directly:
770+
771+
```python
772+
# Returns raw List[float] instead of a dictionary wrapper
773+
vector = llm.create_embedding("Text", output_format="array")
774+
775+
# Returns a similarity matrix (A @ A.T) in the response
776+
# Note: Requires numpy installed
777+
response = llm.create_embedding(
778+
["apple", "fruit", "car"],
779+
output_format="json+"
749780
)
781+
print(response["cosineSimilarity"])
750782
```
751783

752-
### Embeddings
784+
#### 2. Reranking (Cross-Encoder Scoring)
753785

754-
To generate text embeddings use [`create_embedding`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_embedding) or [`embed`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.embed). Note that you must pass `embedding=True` to the constructor upon model creation for these to work properly.
786+
Reranking models (like `bge-reranker`) take a **Query** and a list of **Documents** as input and output a relevance score (scalar) for each document.
787+
788+
> **Important:** You must explicitly set `pooling_type` to `LLAMA_POOLING_TYPE_RANK` (4) when initializing the model.
755789
756790
```python
757791
import llama_cpp
792+
from llama_cpp.llama_embedding import LlamaEmbedding
758793

759-
llm = llama_cpp.Llama(model_path="path/to/model.gguf", embedding=True)
794+
# Initialize a Reranking model
795+
ranker = LlamaEmbedding(
796+
model_path="path/to/bge-reranker-v2-m3.gguf",
797+
pooling_type=llama_cpp.LLAMA_POOLING_TYPE_RANK # Crucial for Rerankers!
798+
)
760799

761-
embeddings = llm.create_embedding("Hello, world!")
800+
query = "What causes rain?"
801+
docs = [
802+
"Clouds are made of water droplets...", # Relevant
803+
"To bake a cake you need flour...", # Irrelevant
804+
"Rain is liquid water in the form of droplets..." # Highly Relevant
805+
]
762806

763-
# or create multiple embeddings at once
807+
# Calculate relevance scores
808+
# Logic: Constructs inputs like "[BOS] query [SEP] doc [EOS]" automatically
809+
scores = ranker.rank(query, docs)
764810

765-
embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])
811+
# Result: List of floats (higher means more relevant)
812+
print(scores)
813+
# e.g., [-0.15, -8.23, 5.67] -> The 3rd doc is the best match
766814
```
767815

768-
There are two primary notions of embeddings in a Transformer-style model: *token level* and *sequence level*. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.
816+
#### 3. Normalization
817+
818+
The `embed` method supports various mathematical normalization strategies via the `normalize` parameter.
819+
820+
| Normalization modes | $Integer$ | Description | Formula |
821+
|---------------------|-----------|---------------------|---------|
822+
| NORM_MODE_NONE | $-1$ | none |
823+
| NORM_MODE_MAX_INT16 | $0$ | max absolute int16 | $\Large{{32760 * x_i} \over\max \lvert x_i\rvert}$
824+
| NORM_MODE_TAXICAB | $1$ | taxicab | $\Large{x_i \over\sum \lvert x_i\rvert}$
825+
| NORM_MODE_EUCLIDEAN | $2$ | euclidean (default) | $\Large{x_i \over\sqrt{\sum x_i^2}}$
826+
| NORM_MODE_PNORM | $>2$ | p-norm | $\Large{x_i \over\sqrt[p]{\sum \lvert x_i\rvert^p}}$
827+
828+
This is useful for optimizing storage or preparing vectors for cosine similarity search (which requires L2 normalization).
829+
830+
```python
831+
from llama_cpp.llama_embedding import NORM_MODE_MAX_INT16, NORM_MODE_TAXICAB, NORM_MODE_EUCLIDEAN
832+
833+
# Taxicab (L1)
834+
vec_l1 = llm.embed("text", normalize=NORM_MODE_TAXICAB)
769835

770-
Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.
836+
# Default is Euclidean (L2) - Standard for vector databases
837+
vec_l2 = llm.embed("text", normalize=NORM_MODE_EUCLIDEAN)
771838

772-
It is possible to control pooling behavior in some cases using the `pooling_type` flag on model creation. You can ensure token level embeddings from any model using `LLAMA_POOLING_TYPE_NONE`. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.
839+
# Max Absolute Int16 - Useful for quantization/compression
840+
vec_int16 = llm.embed("text", normalize=NORM_MODE_MAX_INT16)
841+
842+
# Raw Output (No Normalization) - Get the raw floating point values from the model
843+
embeddings_raw = llm.embed(["search query", "document text"], normalize=NORM_MODE_NONE)
844+
```
845+
846+
#### Legacy Usage (Deprecated)
847+
848+
The standard `Llama` class still supports basic embedding generation, but it lacks the memory optimizations and reranking capabilities of `LlamaEmbedding`.
849+
850+
```python
851+
# Old method - Not recommended for large batches or reranking
852+
llm = llama_cpp.Llama(model_path="...", embedding=True)
853+
emb = llm.create_embedding("text")
854+
```
855+
856+
---
857+
858+
### Speculative Decoding
859+
860+
`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.
861+
862+
The fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.
863+
864+
Just pass this as a draft model to the `Llama` class during initialization.
865+
866+
```python
867+
from llama_cpp import Llama
868+
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
869+
870+
llama = Llama(
871+
model_path="path/to/model.gguf",
872+
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
873+
)
874+
```
773875

774876
### Adjusting the Context Window
775877

llama_cpp/llama_embedding.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
NORM_MODE_MAX_INT16 = 0
2020
NORM_MODE_TAXICAB = 1
2121
NORM_MODE_EUCLIDEAN = 2
22+
NORM_MODE_PNORM = 6
2223

2324
# TODO(JamePeng): Needs more extensive testing with various embedding and reranking models.
2425
class LlamaEmbedding(Llama):

0 commit comments

Comments
 (0)