You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Docs: Update README to feature LlamaEmbedding and Reranking workflows
- Added usage guide for the new `LlamaEmbedding` class.
- Included code snippets for Reranking, Batching, and Normalization.
- Updated legacy examples to reflect current best practices.
`llama-cpp-python`supports speculative decoding which allows the model to generate completions based on a draft model.
737
+
`llama-cpp-python`provides a high-performance, memory-efficient specialized class `LlamaEmbedding` for generating text embeddings and calculating reranking scores.
737
738
738
-
The fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.
739
+
**Key Features:**
740
+
***Streaming Batch Processing:** Process massive datasets (e.g., Hundreds of documents) without running out of memory (OOM).
741
+
***Native Reranking:** Built-in support for Cross-Encoder models (outputting relevance scores instead of vectors).
742
+
***Optimized Performance:** Utilizes Unified KV Cache for parallel encoding of multiple documents.
739
743
740
-
Just pass this as a draft model to the `Llama` class during initialization.
744
+
### TODO(JamePeng): Needs more extensive testing with various embedding and rerank models. :)
745
+
746
+
#### 1. Text Embeddings (Vector Search)
747
+
748
+
To generate embeddings, use the `LlamaEmbedding` class. It automatically configures the model for vector generation.
741
749
742
750
```python
743
-
from llama_cpp import Llama
744
-
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
751
+
from llama_cpp.llama_embedding import LlamaEmbedding
745
752
746
-
llama = Llama(
747
-
model_path="path/to/model.gguf",
748
-
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
753
+
# Initialize the model (automatically sets embedding=True)
# Returns a similarity matrix (A @ A.T) in the response
776
+
# Note: Requires numpy installed
777
+
response = llm.create_embedding(
778
+
["apple", "fruit", "car"],
779
+
output_format="json+"
749
780
)
781
+
print(response["cosineSimilarity"])
750
782
```
751
783
752
-
###Embeddings
784
+
#### 2. Reranking (Cross-Encoder Scoring)
753
785
754
-
To generate text embeddings use [`create_embedding`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_embedding) or [`embed`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.embed). Note that you must pass `embedding=True` to the constructor upon model creation for these to work properly.
786
+
Reranking models (like `bge-reranker`) take a **Query** and a list of **Documents** as input and output a relevance score (scalar) for each document.
787
+
788
+
> **Important:** You must explicitly set `pooling_type` to `LLAMA_POOLING_TYPE_RANK` (4) when initializing the model.
755
789
756
790
```python
757
791
import llama_cpp
792
+
from llama_cpp.llama_embedding import LlamaEmbedding
# Result: List of floats (higher means more relevant)
812
+
print(scores)
813
+
# e.g., [-0.15, -8.23, 5.67] -> The 3rd doc is the best match
766
814
```
767
815
768
-
There are two primary notions of embeddings in a Transformer-style model: *token level* and *sequence level*. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.
816
+
#### 3. Normalization
817
+
818
+
The `embed` method supports various mathematical normalization strategies via the `normalize` parameter.
819
+
820
+
| Normalization modes | $Integer$ | Description | Formula |
Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.
836
+
# Default is Euclidean (L2) - Standard for vector databases
It is possible to control pooling behavior in some cases using the `pooling_type` flag on model creation. You can ensure token level embeddings from any model using `LLAMA_POOLING_TYPE_NONE`. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.
839
+
# Max Absolute Int16 - Useful for quantization/compression
The standard `Llama` class still supports basic embedding generation, but it lacks the memory optimizations and reranking capabilities of `LlamaEmbedding`.
849
+
850
+
```python
851
+
# Old method - Not recommended for large batches or reranking
`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.
861
+
862
+
The fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.
863
+
864
+
Just pass this as a draft model to the `Llama` class during initialization.
865
+
866
+
```python
867
+
from llama_cpp import Llama
868
+
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
869
+
870
+
llama = Llama(
871
+
model_path="path/to/model.gguf",
872
+
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
0 commit comments