Only Stable Diffusion Uses GPU — LLM and Embedding Models Run on CPU in LocalAGI Stack

Title: 
🧠 Only Stable Diffusion Uses GPU — LLM and Embedding Models Run on CPU in LocalAGI Stack
<hr>
Issue Description:
When running the full LocalAGI stack using the provided <code inline="">docker-compose.nvidia.yaml</code>, only Stable Diffusion models make use of the GPU. In contrast, LLM (e.g., Gemma) and embedding (e.g., Granite) models run entirely on the CPU, despite launching the stack with GPU support.
This leads to an inconsistent experience where Stable Diffusion responses are fast and GPU-accelerated, while LLM chats and embedding operations are significantly slower and CPU-bound.
<hr>
<h3>💻 Reproduction Steps</h3>
<ol>
<li>
Clone the LocalAGI repo and start the stack using:
<pre><code class="language-bash">docker compose -f docker-compose.nvidia.yaml up -d
</code></pre>
</li>
<li>
Open the web UI or call the OpenAI API endpoints for:
<ul>
<li>
Chat: <code inline="">POST /v1/chat/completions</code>
</li>
<li>
Embedding: <code inline="">POST /v1/embeddings</code>
</li>
<li>
Image generation: <code inline="">POST /v1/images/generations</code>
</li>
</ul>
</li>
<li>
Monitor GPU usage using:
<pre><code class="language-bash">nvidia-smi
</code></pre>
</li>
</ol>
<hr>
<h3>📌 Observed Behavior</h3>

Action | Result
-- | --
Generating images via Stable Diffusion | ✅ GPU is fully utilized
Sending chat requests to LLMs like Gemma | ❌ CPU only, no GPU activity
Requesting embeddings (e.g., Granite) | ❌ CPU only, no GPU activity


In all non-image cases, the response is noticeably slower and the GPU shows no load.
<hr>
<h3>📊 Additional Notes</h3>
<ul>
<li>
This issue is specific to the LocalAGI compose stack. When I run the same LocalAI image independently via:
<pre><code class="language-bash">docker run --gpus all -p 8081:8080 localai/localai:latest-gpu-nvidia-cuda-12
</code></pre>
...the GPU is correctly used for LLMs, embeddings, and Stable Diffusion.
</li>
<li>
The container logs do not indicate any errors. Models load successfully, and inference works — it just stays on the CPU for all non-image models.
</li>
<li>
This behavior is consistent across restarts, GPU driver reinstalls, and even after re-pulling the Docker images.
</li>
</ul>
<hr>
<h3>📎 Summary</h3>
Only Stable Diffusion models are GPU-accelerated when using the LocalAGI stack. LLMs and embedding models run entirely on CPU even when the container has access to the GPU. The cause of this inconsistency is unclear, but it leads to significant performance differences and an inconsistent runtime environment across model types.
Would appreciate clarification on whether this is expected behavior or a misconfiguration somewhere in the default stack.</body></html>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only Stable Diffusion Uses GPU — LLM and Embedding Models Run on CPU in LocalAGI Stack #177

💻 Reproduction Steps

📌 Observed Behavior

📊 Additional Notes

📎 Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Action	Result
Generating images via Stable Diffusion	✅ GPU is fully utilized
Sending chat requests to LLMs like Gemma	❌ CPU only, no GPU activity
Requesting embeddings (e.g., Granite)	❌ CPU only, no GPU activity

Only Stable Diffusion Uses GPU — LLM and Embedding Models Run on CPU in LocalAGI Stack #177

Description

💻 Reproduction Steps

📌 Observed Behavior

📊 Additional Notes

📎 Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions