LocalAI gets blocked at `Model already loaded in memory:` after hours of successful inferencing

**LocalAI version:**

`image: quay.io/go-skynet/local-ai:master-cublas-cuda11`

**Environment, CPU architecture, OS, and Version:**

Kubernetes deployment with above image. Underneath it is an AMD EPYC Milan CPU (16 core) + A4500 GPU

**Describe the bug**

I start the server, then begin sending one request at a time to it. After 100s of successful inferences. One blocks at the message below. Then the GPU sits hovering above 94% util and full power indefinitely. All subsequent requests timeout on the openai side. Requests are all exactly the same size and type. I see it gets stuck on some prompts that are extremely short and simple without any special chars.

It is hard to predict "when" this will happen. But it seems that it always will eventually. Sometimes it can happen within an hour, other times it takes 12 hours.

I am limiting the number of in-flight requests to 1 using my client.

```
8:02AM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
8:02AM DBG Stopping all backends except 'wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin'
8:02AM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
```

**To Reproduce**

Here is my model YAML:

```yaml
backend: llama-stable
context_size: 4096
batch: 512
threads: 1
f16: true
gpu_layers: 43
mmlock: true
name: wizard13B_gpu
parameters:
  model: wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
  temperature: 0.2
  top_p: 0.9
roles:
  assistant: 'ASSISTANT:'
  system: 'SYSTEM:'
  user: 'USER:'
stopwords:
  - "USER:"
  - "</s>"
template:
  chat: wizard_chat
  completion: wizard_completion
  ```

**Expected behavior**

I would expect that it continues operating as it does for the first few hundred requests.

**Logs**

```
8:02AM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
8:02AM DBG Stopping all backends except 'wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin'
8:02AM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q5_K_M.bin
```

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LocalAI gets blocked at `Model already loaded in memory:` after hours of successful inferencing #1017

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

LocalAI gets blocked at Model already loaded in memory: after hours of successful inferencing #1017

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

LocalAI gets blocked at `Model already loaded in memory:` after hours of successful inferencing #1017