Skip to content

Only Stable Diffusion Uses GPU β€” LLM and Embedding Models Run on CPU in LocalAGI StackΒ #177

Closed
@neovasky

Description

@neovasky

Title:
🧠 Only Stable Diffusion Uses GPU β€” LLM and Embedding Models Run on CPU in LocalAGI Stack


Issue Description:

When running the full LocalAGI stack using the provided docker-compose.nvidia.yaml, only Stable Diffusion models make use of the GPU. In contrast, LLM (e.g., Gemma) and embedding (e.g., Granite) models run entirely on the CPU, despite launching the stack with GPU support.

This leads to an inconsistent experience where Stable Diffusion responses are fast and GPU-accelerated, while LLM chats and embedding operations are significantly slower and CPU-bound.


πŸ’» Reproduction Steps

  1. Clone the LocalAGI repo and start the stack using:

    docker compose -f docker-compose.nvidia.yaml up -d
    
  2. Open the web UI or call the OpenAI API endpoints for:

    • Chat: POST /v1/chat/completions

    • Embedding: POST /v1/embeddings

    • Image generation: POST /v1/images/generations

  3. Monitor GPU usage using:

    nvidia-smi
    

πŸ“Œ Observed Behavior

Action Result
Generating images via Stable Diffusion βœ… GPU is fully utilized
Sending chat requests to LLMs like Gemma ❌ CPU only, no GPU activity
Requesting embeddings (e.g., Granite) ❌ CPU only, no GPU activity

In all non-image cases, the response is noticeably slower and the GPU shows no load.


πŸ“Š Additional Notes

  • This issue is specific to the LocalAGI compose stack. When I run the same LocalAI image independently via:

    docker run --gpus all -p 8081:8080 localai/localai:latest-gpu-nvidia-cuda-12
    

    ...the GPU is correctly used for LLMs, embeddings, and Stable Diffusion.

  • The container logs do not indicate any errors. Models load successfully, and inference works β€” it just stays on the CPU for all non-image models.

  • This behavior is consistent across restarts, GPU driver reinstalls, and even after re-pulling the Docker images.


πŸ“Ž Summary

Only Stable Diffusion models are GPU-accelerated when using the LocalAGI stack. LLMs and embedding models run entirely on CPU even when the container has access to the GPU. The cause of this inconsistency is unclear, but it leads to significant performance differences and an inconsistent runtime environment across model types.

Would appreciate clarification on whether this is expected behavior or a misconfiguration somewhere in the default stack.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions