-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
LocalAI version:
image: quay.io/go-skynet/local-ai:master-cublas-cuda11
Environment, CPU architecture, OS, and Version:
Kubernetes deployment with above image. Underneath it is an AMD EPYC Milan CPU (16 core) + A4500 GPU
Describe the bug
I start the server, then begin sending one request at a time to it. After 100s of successful inferences. One blocks at the message below. Then the GPU sits hovering above 94% util and full power indefinitely. All subsequent requests timeout on the openai side. Requests are all exactly the same size and type. I see it gets stuck on some prompts that are extremely short and simple without any special chars.
It is hard to predict "when" this will happen. But it seems that it always will eventually. Sometimes it can happen within an hour, other times it takes 12 hours.
I am limiting the number of in-flight requests to 1 using my client.
8:02AM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
8:02AM DBG Stopping all backends except 'wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin'
8:02AM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
To Reproduce
Here is my model YAML:
backend: llama-stable
context_size: 4096
batch: 512
threads: 1
f16: true
gpu_layers: 43
mmlock: true
name: wizard13B_gpu
parameters:
model: wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
temperature: 0.2
top_p: 0.9
roles:
assistant: 'ASSISTANT:'
system: 'SYSTEM:'
user: 'USER:'
stopwords:
- "USER:"
- "</s>"
template:
chat: wizard_chat
completion: wizard_completion
Expected behavior
I would expect that it continues operating as it does for the first few hundred requests.
Logs
8:02AM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
8:02AM DBG Stopping all backends except 'wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin'
8:02AM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
Additional context