Description
Problem Description:
I am attempting to deploy the multilingual-e5-base
embedding model for local inference on Windows 11 using LocalAI via Docker Compose with NVIDIA GPU acceleration (RTX 1660 SUPER, CUDA 12).
Despite configuring the model via a YAML file and manually placing a compatible GGUF file, I encounter inconsistent behavior depending on how the model is referenced in the API call.
- When calling the embeddings API using the model name specified in the YAML (
multilingual-e5-base
), the request fails with abackend not found
error, specifically referencingllama-embeddings
. - When calling the embeddings API directly using the GGUF filename (
multilingual-e5-base-Q8_0.gguf
), the model loads successfully via thellama-cpp
backend and utilizes the GPU, but the returned embedding vector is consistently empty ([]
), with logs indicatingembedding disabled
.
This suggests an issue with the integration or routing of the llama-embeddings
backend within the Docker image builds for CUDA 12, or potentially a parameter passing issue when using the underlying llama-cpp
library directly.
Steps to Reproduce:
-
Environment Setup:
- Operating System: Windows 11
- Docker Desktop installed and running.
- NVIDIA GPU: GeForce GTX 1660 SUPER
- NVIDIA Driver: Compatible with CUDA 12 (logs showed CUDA Version: 12.7).
- LocalAI deployed using Docker Compose.
-
docker-compose.yaml
Configuration:- Used a standard
docker-compose.yaml
obtained from the LocalAI GitHub repository. - Modified the
image:
to use CUDA 12 compatible tags (testedmaster-cublas-cuda12
andmaster-aio-gpu-nvidia-cuda-12
). The logs provided below are frommaster-aio-gpu-nvidia-cuda-12
. - Added
deploy:
section for NVIDIA GPU. - Ensured
volumes:
maps./models
to/models:cached
. - Ensured
environment:
includesMODELS_PATH=/models
andDEBUG=true
. - Crucially, removed or commented out the default
command:
line. - Removed or commented out
DOWNLOAD_MODELS=true
.
# Relevant parts of docker-compose.yaml services: api: image: quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12 # Or master-cublas-cuda12 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ports: - 8080:8080 environment: - MODELS_PATH=/models - DEBUG=true # - DOWNLOAD_MODELS=true # Removed volumes: - ./models:/models:cached # command: # Removed or commented out # - some-model
- Used a standard
-
Model File and Configuration Setup:
- Manually downloaded the
multilingual-e5-base-Q8_0.gguf
file fromhttps://huggingface.co/yixuan-chia/multilingual-e5-base-gguf
. - Created the
./models/
directory in the LocalAI project root. - Placed the downloaded
multilingual-e5-base-Q8_0.gguf
file in the./models/
directory. - Created the
multilingual-e5-base.yaml
file in the./models/
directory with the following content:
# ./models/multilingual-e5-base.yaml name: multilingual-e5-base backend: llama-embeddings # Specify backend embeddings: true # Mark as embeddings model parameters: model: multilingual-e5-base-Q8_0.gguf # File name relative to MODELS_PATH n_gpu_layers: -1 # Attempt to offload all layers to GPU embedding: true # Explicitly set embedding parameter f16: true
- Manually downloaded the
-
Deploy LocalAI:
- Open PowerShell in the directory containing
docker-compose.yaml
. - Run
docker-compose down
. - Run
docker-compose pull <selected_image_tag>
. - Run
docker-compose up -d
.
- Open PowerShell in the directory containing
-
Attempt Embeddings API Calls: Wait for LocalAI to start (check logs or
/readyz
).- Attempt 1 (Using YAML name):
curl -X POST http://localhost:8080/v1/embeddings ` -H "Content-Type: application/json" ` -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # Use YAML name -v
- Attempt 2 (Using GGUF filename):
(Note: Adding
curl -X POST http://localhost:8080/v1/embeddings ` -H "Content-Type: application/json" ` -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # Use GGUF filename -v
"embeddings": true
to the JSON body in Attempt 2 yielded the same result).
- Attempt 1 (Using YAML name):
Expected Behavior:
- Both Attempt 1 and Attempt 2 should return a
200 OK
response with a JSON body containing adata
array, where each element has a non-emptyembedding
list (the vector). - Logs should indicate successful loading and use of the model, preferably utilizing the GPU.
Observed Behavior:
- Attempt 1 (Using YAML name): Returns
500 Internal Server Error
with the message"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings"
. (See Curl Output 1 below). - Attempt 2 (Using GGUF filename): Returns
200 OK
status, but theembedding
list in the JSON response is empty ([]
). (See Curl Output 2 below). Docker logs show the model is loaded but embedding is disabled.
Environment Information:
- OS: Windows 11
- Docker Desktop Version: (Please specify your version, e.g., 4.29.0)
- GPU: NVIDIA GeForce GTX 1660 SUPER
- NVIDIA Driver Version: (Please specify your driver version)
- CUDA Version (as reported by
nvidia-smi
in logs): 12.7 - LocalAI Docker Image Tags Tested:
quay.io/go-skynet/local-ai:master-cublas-cuda12
,quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12
, potentially others fromsha-*-cuda12
. All tested tags exhibiting the "backend not found" error when using the YAML name. - LocalAI Version (as reported in logs):
4076ea0
(from the master branch)
Relevant Logs:
-
Curl Output 1 (Attempt 1 - calling with YAML name):
(base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings ` >> -H "Content-Type: application/json" ` >> -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # <-- Use YAML name {"error":{"code":500,"message":"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings","type":""}} ... (rest of curl -v output showing 500 Internal Server Error) ...
-
Curl Output 2 (Attempt 2 - calling with GGUF filename):
(base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings ` >> -H "Content-Type: application/json" ` >> -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # <-- Use GGUF filename {"created":1746090262,"object":"list","id":"a4e28026-95c6-46d5-ad7b-3a3ce87a14e5","model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} ... (rest of curl -v output showing 200 OK) ...
(The output is the same when adding
"embeddings": true
to the request body). -
Docker Logs (Excerpt showing "backend not found" for YAML name call):
... (startup logs) ... 8:59AM INF Preloading models from /models # LocalAI finds the YAML and GGUF Model name: multilingual-e5-base 8:59AM DBG Model: multilingual-e5-base (config: {... parameters:{model:multilingual-e5-base-Q8_0.gguf ... Backend:llama-embeddings Embeddings:true ...}}) # Correct config loaded ... (user sends curl request with model: "multilingual-e5-base") ... 8:59AM INF BackendLoader starting backend=llama-embeddings modelID=multilingual-e5-base o.model=multilingual-e5-base-Q8_0.gguf # Attempting to load via backend name 8:59AM DBG Loading model in memory from file: /models/multilingual-e5-base-Q8_0.gguf # Attempting to load file 8:59AM DBG Loading Model multilingual-e5-base with gRPC (file: /models/multilingual-e5-base-Q8_0.gguf) (backend: llama-embeddings): {...} 8:59AM ERR Server error error="failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings" ip=172.19.0.1 latency=2m22.975112253s method=POST status=500 url=/v1/embeddings # Backend executable not found ...
-
Docker Logs (Excerpt showing model loaded but embedding disabled for GGUF filename call):
... (user sends curl request with model: "multilingual-e5-base-Q8_0.gguf") ... 9:04AM DBG Model file loaded: multilingual-e5-base-Q8_0.gguf architecture=bert bosTokenID=0 eosTokenID=2 modelName= # File identified ... 9:04AM INF Trying to load the model 'multilingual-e5-base-Q8_0.gguf' with the backend '[llama-cpp llama-cpp-fallback ...]' # Tries multiple backends, including llama-cpp 9:04AM INF [llama-cpp] Attempting to load ... 9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 SUPER) - 5134 MiB free # GPU detected and used ... 9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_loader: loaded meta data with 35 key-value pairs ... from /models/multilingual-e5-base-Q8_0.gguf (version GGUF V3 (latest)) # GGUF loaded successfully ... 9:04AM INF [llama-cpp] Loads OK # Model loaded successfully by llama-cpp ... 9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stdout {"timestamp":...,"level":"WARNING","function":"send_embedding","line":1368,"message":"embedding disabled","params.embedding":false} # Embedding is explicitly disabled ... 9:04AM DBG Response: {"created":...,"object":"list","id":...,"model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{...}} # Empty embedding returned ...
Additional Context:
- The
text-embedding-ada-002
model, which also uses thellama-cpp
backend (based on its YAML configuration in LocalAI's AIO image), successfully loads and returns embedding vectors using the same LocalAI Docker image and the/v1/embeddings
endpoint. This confirms that the corellama-cpp
library and the general embeddings functionality are working correctly within the container and with the GPU. - This issue seems specific to how the
multilingual-e5-base
model (perhaps due to its architecture being "bert" as shown in logs, or differences in its GGUF structure) interacts with LocalAI'sllama-embeddings
backend abstraction, or how parameters (likeembeddings: true
) are passed tollama-cpp
in different loading scenarios. - I have tried different CUDA 12 master branch tags (
master-cublas-cuda12
,master-aio-gpu-nvidia-cuda-12
) and they all exhibit the same "backend not found" error when calling by YAML name.
This detailed information should help the LocalAI developers diagnose the specific issue within their build or model loading logic for llama-embeddings
with this type of model/GGUF.