[BUG]: CUDA errors with two GPUs (multiple parallel requests)

### Description

I run several requests (3-4) at the same time, which are executed sequentially by LLamaEmbedder.GetEmbeddings() and StatelessExecutor.InferAsync().

The models for these commands are different. 
For Infer (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
For Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M

There is always enough memory for queries with a margin.

**1. One GPU**
-- First there was the CUDA errors:
CUDA error: operation failed due to a previous error during capture
CUDA error: operation not permitted when stream is capturing
ggml_cuda_compute_forward: ADD failed

-- The errors went away when I added thread blocking to GetEmbeddings() and CreateContext/Destroy to InferAsync()

Why did I have to do this, is it right?

**Questions:** 
what are the general limitations of multithreading for LLamaSharp? What should be considered in this case? Does anyone have experience implementing a multi-threaded web application?

**2. Two GPUs**

GPUSplitMode = GPUSplitMode.Layer;

Despite the fixes for one GPU, errors still occur on two GPUs:

2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
  SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
  SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error:   current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
  SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error:   current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
  SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error:   cudaDeviceSynchronize()
  SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error:   cudaGetLastError()
  SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode


2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
  SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
  SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
  SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error:   current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
  SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error:   current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
  SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error:   cudaDeviceSynchronize()
  SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error:   err
  SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

**Questions:** 
what to do? What should I pay attention to?

If each subsequent request is sent after 2-3 seconds, then everything works!

As a result of many hours of experimentation, I think that creating and deleting a context (where VRAM memory is allocated) should be performed in thread-safe mode (inside lock).

It may also need to be taken into account in other places where the GPU resource is used.

Thanks.


### Reproduction Steps

Multiple parallel requests

### Environment & Configuration

- Operating system: Windows Server 2019
- .NET runtime version: 9.0.1
- LLamaSharp version: 0.21.0
- CUDA version (if you are using cuda backend): 12.8 
- CPU & GPU device:  2 x RTX 4090 24Gb


### Known Workarounds

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

Description

Reproduction Steps

Environment & Configuration

Known Workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

Description

Description

Reproduction Steps

Environment & Configuration

Known Workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions