[Bug] Multilingual-e5-base Embeddings Issue with llama-embeddings Backend on CUDA 12 Docker (Windows 11)

**Problem Description:**

I am attempting to deploy the `multilingual-e5-base` embedding model for local inference on Windows 11 using LocalAI via Docker Compose with NVIDIA GPU acceleration (RTX 1660 SUPER, CUDA 12).

Despite configuring the model via a YAML file and manually placing a compatible GGUF file, I encounter inconsistent behavior depending on how the model is referenced in the API call.

*   When calling the embeddings API using the model name specified in the YAML (`multilingual-e5-base`), the request fails with a `backend not found` error, specifically referencing `llama-embeddings`.
*   When calling the embeddings API directly using the GGUF filename (`multilingual-e5-base-Q8_0.gguf`), the model loads successfully via the `llama-cpp` backend and utilizes the GPU, but the returned embedding vector is consistently empty (`[]`), with logs indicating `embedding disabled`.

This suggests an issue with the integration or routing of the `llama-embeddings` backend within the Docker image builds for CUDA 12, or potentially a parameter passing issue when using the underlying `llama-cpp` library directly.

**Steps to Reproduce:**

1.  **Environment Setup:**
    *   Operating System: Windows 11
    *   Docker Desktop installed and running.
    *   NVIDIA GPU: GeForce GTX 1660 SUPER
    *   NVIDIA Driver: Compatible with CUDA 12 (logs showed CUDA Version: 12.7).
    *   LocalAI deployed using Docker Compose.

2.  **`docker-compose.yaml` Configuration:**
    *   Used a standard `docker-compose.yaml` obtained from the LocalAI GitHub repository.
    *   Modified the `image:` to use CUDA 12 compatible tags (tested `master-cublas-cuda12` and `master-aio-gpu-nvidia-cuda-12`). The logs provided below are from `master-aio-gpu-nvidia-cuda-12`.
    *   Added `deploy:` section for NVIDIA GPU.
    *   Ensured `volumes:` maps `./models` to `/models:cached`.
    *   Ensured `environment:` includes `MODELS_PATH=/models` and `DEBUG=true`.
    *   **Crucially, removed or commented out the default `command:` line.**
    *   Removed or commented out `DOWNLOAD_MODELS=true`.

    ```yaml
    # Relevant parts of docker-compose.yaml
    services:
      api:
        image: quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12 # Or master-cublas-cuda12

        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: 1
                  capabilities: [gpu]
        ports:
          - 8080:8080
        environment:
          - MODELS_PATH=/models
          - DEBUG=true
          # - DOWNLOAD_MODELS=true # Removed
        volumes:
          - ./models:/models:cached
        # command: # Removed or commented out
        # - some-model
    ```

3.  **Model File and Configuration Setup:**
    *   Manually downloaded the `multilingual-e5-base-Q8_0.gguf` file from `https://huggingface.co/yixuan-chia/multilingual-e5-base-gguf`.
    *   Created the `./models/` directory in the LocalAI project root.
    *   Placed the downloaded `multilingual-e5-base-Q8_0.gguf` file in the `./models/` directory.
    *   Created the `multilingual-e5-base.yaml` file in the `./models/` directory with the following content:

    ```yaml
    # ./models/multilingual-e5-base.yaml
    name: multilingual-e5-base
    backend: llama-embeddings # Specify backend
    embeddings: true          # Mark as embeddings model
    parameters:
      model: multilingual-e5-base-Q8_0.gguf # File name relative to MODELS_PATH
      n_gpu_layers: -1 # Attempt to offload all layers to GPU
      embedding: true # Explicitly set embedding parameter
    f16: true
    ```

4.  **Deploy LocalAI:**
    *   Open PowerShell in the directory containing `docker-compose.yaml`.
    *   Run `docker-compose down`.
    *   Run `docker-compose pull <selected_image_tag>`.
    *   Run `docker-compose up -d`.

5.  **Attempt Embeddings API Calls:** Wait for LocalAI to start (check logs or `/readyz`).
    *   **Attempt 1 (Using YAML name):**
        ```powershell
        curl -X POST http://localhost:8080/v1/embeddings `
             -H "Content-Type: application/json" `
             -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # Use YAML name
             -v
        ```
    *   **Attempt 2 (Using GGUF filename):**
        ```powershell
        curl -X POST http://localhost:8080/v1/embeddings `
             -H "Content-Type: application/json" `
             -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # Use GGUF filename
             -v
        ```
        (Note: Adding `"embeddings": true` to the JSON body in Attempt 2 yielded the same result).

**Expected Behavior:**

*   Both Attempt 1 and Attempt 2 should return a `200 OK` response with a JSON body containing a `data` array, where each element has a non-empty `embedding` list (the vector).
*   Logs should indicate successful loading and use of the model, preferably utilizing the GPU.

**Observed Behavior:**

*   **Attempt 1 (Using YAML name):** Returns `500 Internal Server Error` with the message `"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings"`. (See Curl Output 1 below).
*   **Attempt 2 (Using GGUF filename):** Returns `200 OK` status, but the `embedding` list in the JSON response is empty (`[]`). (See Curl Output 2 below). Docker logs show the model is loaded but embedding is disabled.

**Environment Information:**

*   OS: Windows 11
*   Docker Desktop Version: (Please specify your version, e.g., 4.29.0)
*   GPU: NVIDIA GeForce GTX 1660 SUPER
*   NVIDIA Driver Version: (Please specify your driver version)
*   CUDA Version (as reported by `nvidia-smi` in logs): 12.7
*   LocalAI Docker Image Tags Tested: `quay.io/go-skynet/local-ai:master-cublas-cuda12`, `quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12`, potentially others from `sha-*-cuda12`. All tested tags exhibiting the "backend not found" error when using the YAML name.
*   LocalAI Version (as reported in logs): `4076ea0` (from the master branch)

**Relevant Logs:**

*   **Curl Output 1 (Attempt 1 - calling with YAML name):**
    ```
    (base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings `
    >>      -H "Content-Type: application/json" `
    >>      -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # <-- Use YAML name
    {"error":{"code":500,"message":"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings","type":""}}
    ... (rest of curl -v output showing 500 Internal Server Error) ...
    ```

*   **Curl Output 2 (Attempt 2 - calling with GGUF filename):**
    ```
    (base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings `
    >>      -H "Content-Type: application/json" `
    >>      -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # <-- Use GGUF filename
    {"created":1746090262,"object":"list","id":"a4e28026-95c6-46d5-ad7b-3a3ce87a14e5","model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
    ... (rest of curl -v output showing 200 OK) ...
    ```
    (The output is the same when adding `"embeddings": true` to the request body).

*   **Docker Logs (Excerpt showing "backend not found" for YAML name call):**
    ```
    ... (startup logs) ...
    8:59AM INF Preloading models from /models # LocalAI finds the YAML and GGUF
      Model name: multilingual-e5-base
    8:59AM DBG Model: multilingual-e5-base (config: {... parameters:{model:multilingual-e5-base-Q8_0.gguf ... Backend:llama-embeddings Embeddings:true ...}}) # Correct config loaded
    ... (user sends curl request with model: "multilingual-e5-base") ...
    8:59AM INF BackendLoader starting backend=llama-embeddings modelID=multilingual-e5-base o.model=multilingual-e5-base-Q8_0.gguf # Attempting to load via backend name
    8:59AM DBG Loading model in memory from file: /models/multilingual-e5-base-Q8_0.gguf # Attempting to load file
    8:59AM DBG Loading Model multilingual-e5-base with gRPC (file: /models/multilingual-e5-base-Q8_0.gguf) (backend: llama-embeddings): {...}
    8:59AM ERR Server error error="failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings" ip=172.19.0.1 latency=2m22.975112253s method=POST status=500 url=/v1/embeddings # Backend executable not found
    ...
    ```

*   **Docker Logs (Excerpt showing model loaded but embedding disabled for GGUF filename call):**
    ```
    ... (user sends curl request with model: "multilingual-e5-base-Q8_0.gguf") ...
    9:04AM DBG Model file loaded: multilingual-e5-base-Q8_0.gguf architecture=bert bosTokenID=0 eosTokenID=2 modelName= # File identified
    ...
    9:04AM INF Trying to load the model 'multilingual-e5-base-Q8_0.gguf' with the backend '[llama-cpp llama-cpp-fallback ...]' # Tries multiple backends, including llama-cpp
    9:04AM INF [llama-cpp] Attempting to load
    ...
    9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 SUPER) - 5134 MiB free # GPU detected and used
    ...
    9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_loader: loaded meta data with 35 key-value pairs ... from /models/multilingual-e5-base-Q8_0.gguf (version GGUF V3 (latest)) # GGUF loaded successfully
    ...
    9:04AM INF [llama-cpp] Loads OK # Model loaded successfully by llama-cpp
    ...
    9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stdout {"timestamp":...,"level":"WARNING","function":"send_embedding","line":1368,"message":"embedding disabled","params.embedding":false} # Embedding is explicitly disabled
    ...
    9:04AM DBG Response: {"created":...,"object":"list","id":...,"model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{...}} # Empty embedding returned
    ...
    ```

**Additional Context:**

*   The `text-embedding-ada-002` model, which also uses the `llama-cpp` backend (based on its YAML configuration in LocalAI's AIO image), **successfully loads and returns embedding vectors** using the same LocalAI Docker image and the `/v1/embeddings` endpoint. This confirms that the core `llama-cpp` library and the general embeddings functionality are working correctly within the container and with the GPU.
*   This issue seems specific to how the `multilingual-e5-base` model (perhaps due to its architecture being "bert" as shown in logs, or differences in its GGUF structure) interacts with LocalAI's `llama-embeddings` backend abstraction, or how parameters (like `embeddings: true`) are passed to `llama-cpp` in different loading scenarios.
*   I have tried different CUDA 12 master branch tags (`master-cublas-cuda12`, `master-aio-gpu-nvidia-cuda-12`) and they all exhibit the same "backend not found" error when calling by YAML name.

This detailed information should help the LocalAI developers diagnose the specific issue within their build or model loading logic for `llama-embeddings` with this type of model/GGUF.

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Multilingual-e5-base Embeddings Issue with llama-embeddings Backend on CUDA 12 Docker (Windows 11) #5289

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Multilingual-e5-base Embeddings Issue with llama-embeddings Backend on CUDA 12 Docker (Windows 11) #5289

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions