Open
Description
Name and Version
$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Device 2: Tesla P40, compute capability 6.1, VMM: yes
Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4187 (be0e350c)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
The new --device
flag does not work with -sm row
.
Devices:
$ ./llama-server --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Device 2: Tesla P40, compute capability 6.1, VMM: yes
Device 3: Tesla P40, compute capability 6.1, VMM: yes
Available devices:
CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 23892 MiB free)
CUDA1: Tesla P40 (24438 MiB, 24290 MiB free)
CUDA2: Tesla P40 (24438 MiB, 24290 MiB free)
CUDA3: Tesla P40 (24438 MiB, 24290 MiB free)
When running with this command:
./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
-md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 1 \
--device CUDA1,CUDA2,CUDA3 --device-draft CUDA0
The main model gets split across as expected across the P40s and the draft model on the 3090. However adding -sm row
the main model gets split across all 4 GPUs instead of just the P40s.
First Bad Commit
likely introduced with #10497 that introduced --device
and --device-draft
Relevant log output
No response