Description
Title:
π§ Only Stable Diffusion Uses GPU β LLM and Embedding Models Run on CPU in LocalAGI Stack
Issue Description:
When running the full LocalAGI stack using the provided docker-compose.nvidia.yaml
, only Stable Diffusion models make use of the GPU. In contrast, LLM (e.g., Gemma) and embedding (e.g., Granite) models run entirely on the CPU, despite launching the stack with GPU support.
This leads to an inconsistent experience where Stable Diffusion responses are fast and GPU-accelerated, while LLM chats and embedding operations are significantly slower and CPU-bound.
π» Reproduction Steps
-
Clone the LocalAGI repo and start the stack using:
docker compose -f docker-compose.nvidia.yaml up -d
-
Open the web UI or call the OpenAI API endpoints for:
-
Chat:
POST /v1/chat/completions
-
Embedding:
POST /v1/embeddings
-
Image generation:
POST /v1/images/generations
-
-
Monitor GPU usage using:
nvidia-smi
π Observed Behavior
Action | Result |
---|---|
Generating images via Stable Diffusion | β GPU is fully utilized |
Sending chat requests to LLMs like Gemma | β CPU only, no GPU activity |
Requesting embeddings (e.g., Granite) | β CPU only, no GPU activity |
In all non-image cases, the response is noticeably slower and the GPU shows no load.
π Additional Notes
-
This issue is specific to the LocalAGI compose stack. When I run the same LocalAI image independently via:
docker run --gpus all -p 8081:8080 localai/localai:latest-gpu-nvidia-cuda-12
...the GPU is correctly used for LLMs, embeddings, and Stable Diffusion.
-
The container logs do not indicate any errors. Models load successfully, and inference works β it just stays on the CPU for all non-image models.
-
This behavior is consistent across restarts, GPU driver reinstalls, and even after re-pulling the Docker images.
π Summary
Only Stable Diffusion models are GPU-accelerated when using the LocalAGI stack. LLMs and embedding models run entirely on CPU even when the container has access to the GPU. The cause of this inconsistency is unclear, but it leads to significant performance differences and an inconsistent runtime environment across model types.
Would appreciate clarification on whether this is expected behavior or a misconfiguration somewhere in the default stack.