tutorial : compute embeddings using llama.cpp #7712
Replies: 6 comments 1 reply
-
thank you for this tutorial, surprising to found this in search engine since it's few hours old. so in my end I run it like this: curl -s -X POST https://example.com/embedding \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "phi-3-mini-4k-instruct",
"content": "The Enigmatic Scholar: A mysterious LLM who speaks in riddles and always leaves breadcrumbs of knowledge for others to unravel. They delight in posing cryptic questions and offering enigmatic clues to guide others on their intellectual quests.",
"encoding_format": "float"
}' | jq . from this command, this is the following output {
"embedding": [
0.02672094851732254,
0.0065623000264167786,
0.011766364797949791,
0.028863387182354927,
0.018085993826389313,
-0.008007422089576721,
// (...snipped...)
-0.0014697747537866235,
-0.004578460939228535,
-0.0034472437109798193,
-0.01315175462514162
]
} Question: how do I utilize this directly in the inference process? (if it's possible) so far I have tested it like this but no avail: curl -s https://example.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-d '{
"model": "phi-3-mini-4k-instruct",
"messages": [
{
"role": "system",
"content": "Act as a concise, helpful assistant. Avoid summaries, disclaimers, and apologies."
},
{
"role": "user",
"content": "introduce yourself"
},
{
"role": "context",
"content": {
"embedding": [-0.0034904987551271915,0.0014886681456118822,-0.03103388287127018,0.0131469015032053,(...snip...),0.022104227915406227]
}
}
],
"stream": false,
"max_tokens": 50,
"temperature": 0.7,
"top_k": 40,
"top_p": 1.0,
"min_p": 0.05000000074505806,
"tfs_z": 1.0,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"stop": [""],
"stream": false
}' | jq . this is the output: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "",
"role": "assistant"
}
}
],
"created": 1717482034,
"model": "phi-3-mini-4k-instruct",
"object": "chat.completion",
"usage": {
"completion_tokens": 1,
"prompt_tokens": 59,
"total_tokens": 60
},
"id": "chatcmpl-z2r3sxoTcHJTQYPCPftDmtf6Tev4zhmz",
"__verbose": {
"content": "",
"id_slot": 0,
"stop": true,
"model": "phi-3-mini-4k-instruct",
"tokens_predicted": 1,
"tokens_evaluated": 59,
"generation_settings": {
"n_ctx": 4096,
"n_predict": -1,
"model": "phi-3-mini-4k-instruct",
"seed": 4294967295,
"temperature": 0.699999988079071,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 1.0,
"min_p": 0.05000000074505806,
"tfs_z": 1.0,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"penalty_prompt_tokens": [],
"use_penalty_prompt_tokens": false,
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"penalize_nl": false,
"stop": [
""
],
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "",
"samplers": [
"top_k",
"tfs_z",
"typical_p",
"top_p",
"min_p",
"temperature"
]
},
"prompt": "<|system|>\nAct as a concise, helpful assistant. Avoid summaries, disclaimers, and apologies.<|end|>\n<|user|>\nintroduce yourself<|end|>\n<|context|>\n<|end|>\n<|assistant|>\n",
"truncated": false,
"stopped_eos": false,
"stopped_word": true,
"stopped_limit": false,
"stopping_word": "",
"tokens_cached": 59,
"timings": {
"prompt_n": 59,
"prompt_ms": 737.838,
"prompt_per_token_ms": 12.50572881355932,
"prompt_per_second": 79.96335238900681,
"predicted_n": 1,
"predicted_ms": 0.732,
"predicted_per_token_ms": 0.732,
"predicted_per_second": 1366.120218579235
},
"oaicompat_token_ctr": 1
}
} I tried to do similar thing in open-webui which is suceeded, but I wish it's possible to be done in llama.cpp server api directly. To setup llama.cpp with open-webui, this is the rough step-by step:
...
) else if %choice%==2 (
set model_path=models\Phi-3-mini-4k-instruct-Q4_K_M.gguf
set model_alias=phi-3-mini-4k-instruct
set model_sysprompt=models\prompt_default.json
set context_length=4096
set api_key=testingonly
set gpu_offload_layer=33
...
%server_exe% --verbose --model %model_path% -a %model_alias% -ngl %gpu_offload_layer% --host %host% --port %port% --api-key %api_key% -c %context_length% --system-prompt-file %model_sysprompt% --embeddings --metrics --slots-endpoint-disable
...
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- ./webui-data:/app/backend/data
ports:
- 172.17.0.1:8009:8080
environment:
- 'OLLAMA_BASE_URL=http://tailscale2-ollama:8820'
- 'WEBUI_SECRET_KEY=changeme'
extra_hosts:
- host.docker.internal:host-gateway
restart: unless-stopped run
During this process happens, this is the logs that was produced in open-webui
I'm still pondering this discussion and for now my conclusion is that it's not possible yet to use embedding during inference in llama.cpp(happy to be proven wrong, of course). I feel like my approach in this case is wrong, but my rubber duck isn't providing more help for now. Do i really need embedding? or the technical term is that i want is RAG? or do I just create |
Beta Was this translation helpful? Give feedback.
-
I tried Jina with the following command:
This successfully returned the embedding. However, when I tried to use the following command:
I got a segmentation fault.
Im on tag b3482 or commit e54c35e |
Beta Was this translation helpful? Give feedback.
-
@ggerganov Thank you for the tutorial! Is it possible to use https://huggingface.co/Qdrant/all_miniLM_L6_v2_with_attentions/tree/main For some reason I am getting unable to load the model error. |
Beta Was this translation helpful? Give feedback.
-
It worked! I did it in Python: import requests
session = requests.Session()
session.headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
}
health = session.request(
method='get',
url="http://localhost:8080/health")
print(health.text)
payload = {
'content': '42 is the answer to the ultimate question of life, the universe, and everything'
}
response = session.request(
method='post',
url='http://localhost:8080/embedding',
json=payload)
print(response.text) Thank you for all your hard work on this repository. This has allowed me to understand open-source LLMs better and keep up with the current trends. Please don't stop what you guys are doing <3! |
Beta Was this translation helpful? Give feedback.
-
Is it a way to import torch on riscv ubuntu platform so to run llama-embedding on riscv ? |
Beta Was this translation helpful? Give feedback.
-
Is it possible to reverse embeddings with llama.cpp? Like trying the famous example:
How to go from resulting vector to the "queen" text? Edit: Not possible, only in very trivial circumstances: discussion |
Beta Was this translation helpful? Give feedback.
-
Overview
This is a short guide for running embedding models such as BERT using
llama.cpp
. We obtain and build the latest version of thellama.cpp
software and use theexamples
to compute basic text embeddings and perform a speed benchmarkInstructions
Obtain and build the latest
llama.cpp
Download the embedding model from HF
In this tutorial, we use the following model: https://huggingface.co/Snowflake/snowflake-arctic-embed-s
Convert the model to GGUF file format
Quantize the model (optional)
$ ▶ ls -l model-* -rw-r--r-- 1 ggerganov staff 67579232 Jun 3 14:21 model-f16.gguf -rw-r--r-- 1 ggerganov staff 36684768 Jun 3 14:22 model-q8_0.gguf
Run basic embedding test
The
-ngl 99
argument specifies to offload 99 layers of the model (i.e. the entire model) to the GPU. Use-ngl 0
for CPU-only computationRun speed benchmark for different input sizes
Start an HTTP server
The maximum input size is 512 tokens. We can use
curl
to send queries to the server:Beta Was this translation helpful? Give feedback.
All reactions