Description
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 ROCm devices:
Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 4: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 5: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 6: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 7: AMD Instinct MI100, compute capability 9.0, VMM: no
version: 4436 (53ff6b9)
built with Ubuntu clang version 12.0.1-19ubuntu3 for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
HIP
Hardware
AMD Instinct MI100
Models
DeepSeek-V2
DeepSeek-V3
Problem description & steps to reproduce
Description
When attempting to run DeepSeek models (V2 or V3) using the ROCm backend, the models load successfully into VRAM but fail to generate any output. One GPU becomes pinned at 100% utilization while the others remain idle.
Commands Used
DeepSeek V2
./llama-cli -m /models/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -ngl 999 --prompt '<|User|>why is the sky blue?<|Assistant|>'
DeepSeek V3
./llama-cli -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 --prompt '<|User|>why is the sky blue?<|Assistant|>'
Observed Behavior
- Model loads successfully and distributes across available GPUs
- After loading, one GPU gets stuck at 100% utilization
- No text generation occurs
- Other GPUs remain idle with only VRAM usage showing
Steps to Reproduce
- Load DeepSeek model (V2 or V3) using llama.cpp with ROCm backend
- Set appropriate number of layers for GPU offload (-ngl parameter)
- For V2: use -ngl 999 for automatic layer distribution
- For V3: use -ngl 48 for specific layer allocation
- Attempt text generation with any prompt using the commands shown above
Additional Notes
- Both models exhibit similar behavior despite different quantization methods
- Model loading and VRAM distribution appears normal
- Issue occurs consistently across multiple attempts
- The same behavior happens when running deepseek-v2:16b-lite-chat-q4_K_M in Ollama
First Bad Commit
No response
Relevant log output
root@dd8e6159288b:/app/build/bin# ./llama-cli -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 --prompt '<|User|>why is the sky blue?<|Assistant|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 ROCm devices:
Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 4: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 5: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 6: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 7: AMD Instinct MI100, compute capability 9.0, VMM: no
build: 4436 (53ff6b9b) with Ubuntu clang version 12.0.1-19ubuntu3 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file: using device ROCm0 (AMD Instinct MI100) - 32180 MiB free
llama_model_load_from_file: using device ROCm1 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm2 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm3 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm4 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm5 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm6 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm7 (AMD Instinct MI100) - 32714 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 46 key-value pairs and 1025 tensors from /deepseek-v3/deepseek-v3-unsloght/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek V3 BF16
llama_model_loader: - kv 3: general.size_label str = 256x20B
llama_model_loader: - kv 4: deepseek2.block_count u32 = 61
llama_model_loader: - kv 5: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 6: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 7: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 8: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 9: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 10: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 13: general.file_type u32 = 10
llama_model_loader: - kv 14: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 15: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 16: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 17: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 18: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 19: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 20: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 21: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 22: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 23: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 24: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 25: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 26: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 27: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 28: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 29: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 30: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 36: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 37: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 40: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 41: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: split.no u16 = 0
llama_model_loader: - kv 44: split.count u16 = 5
llama_model_loader: - kv 45: split.tensors.count i32 = 1025
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q2_K: 482 tensors
llama_model_loader: - type q3_K: 180 tensors
llama_model_loader: - type q4_K: 1 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 129280
llm_load_print_meta: n_merges = 127741
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 163840
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_layer = 61
llm_load_print_meta: n_head = 128
llm_load_print_meta: n_head_kv = 128
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 24576
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18432
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 671.03 B
llm_load_print_meta: model size = 227.47 GiB (2.91 BPW)
llm_load_print_meta: general.name = DeepSeek V3 BF16
llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 131 'Ä'
llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>'
llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>'
llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>'
llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead = 3
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloaded 48/62 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 41684.45 MiB
llm_load_tensors: ROCm0 model buffer size = 23905.15 MiB
llm_load_tensors: ROCm1 model buffer size = 23905.15 MiB
llm_load_tensors: ROCm2 model buffer size = 23905.15 MiB
llm_load_tensors: ROCm3 model buffer size = 23905.15 MiB
llm_load_tensors: ROCm4 model buffer size = 23905.15 MiB
llm_load_tensors: ROCm5 model buffer size = 23905.15 MiB
llm_load_tensors: ROCm6 model buffer size = 23905.15 MiB
llm_load_tensors: ROCm7 model buffer size = 23905.15 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: CPU KV buffer size = 4160.00 MiB
llama_kv_cache_init: ROCm0 KV buffer size = 1920.00 MiB
llama_kv_cache_init: ROCm1 KV buffer size = 1920.00 MiB
llama_kv_cache_init: ROCm2 KV buffer size = 1920.00 MiB
llama_kv_cache_init: ROCm3 KV buffer size = 1920.00 MiB
llama_kv_cache_init: ROCm4 KV buffer size = 1920.00 MiB
llama_kv_cache_init: ROCm5 KV buffer size = 1920.00 MiB
llama_kv_cache_init: ROCm6 KV buffer size = 1920.00 MiB
llama_kv_cache_init: ROCm7 KV buffer size = 1920.00 MiB
llama_new_context_with_model: KV self size = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 2790.00 MiB
llama_new_context_with_model: ROCm1 compute buffer size = 1186.00 MiB
llama_new_context_with_model: ROCm2 compute buffer size = 1186.00 MiB
llama_new_context_with_model: ROCm3 compute buffer size = 1186.00 MiB
llama_new_context_with_model: ROCm4 compute buffer size = 1186.00 MiB
llama_new_context_with_model: ROCm5 compute buffer size = 1186.00 MiB
llama_new_context_with_model: ROCm6 compute buffer size = 1186.00 MiB
llama_new_context_with_model: ROCm7 compute buffer size = 1186.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 88.01 MiB
llama_new_context_with_model: graph nodes = 5025
llama_new_context_with_model: graph splits = 243 (with bs=512), 10 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 20
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | ROCm : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
sampler seed: 4089827234
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
why is the sky blue?
========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
^[3m (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) ^[0m
====================================================================================================================
0 1 0x738c, 16733 54.0°C 97.0W N/A, N/A, 0 1502Mhz 1200Mhz 0% auto 290.0W 89% 100%
1 2 0x738c, 57681 41.0°C 35.0W N/A, N/A, 0 300Mhz 1200Mhz 0% auto 290.0W 84% 0%
2 3 0x738c, 33109 42.0°C 39.0W N/A, N/A, 0 300Mhz 1200Mhz 0% auto 290.0W 84% 0%
3 4 0x738c, 8559 42.0°C 39.0W N/A, N/A, 0 300Mhz 1200Mhz 0% auto 290.0W 84% 0%
4 5 0x738c, 57703 41.0°C 34.0W N/A, N/A, 0 300Mhz 1200Mhz 0% auto 290.0W 84% 0%
5 6 0x738c, 33123 39.0°C 34.0W N/A, N/A, 0 300Mhz 1200Mhz 0% auto 290.0W 84% 0%
6 7 0x738c, 57724 41.0°C 39.0W N/A, N/A, 0 300Mhz 1200Mhz 0% auto 290.0W 84% 0%