Skip to content

Graph build hangs during model warmup #2413

Closed
@ProjectAtlantis-dev

Description

Enabling LLAMA_METAL=1 and running make now appears to give a ./main executable that hangs in call to llama_eval() - main.cpp line 412 - with CPU pegged at 100%. More specifically, it seems to be hanging on ggml_metal_graph_find_concurrency() in llama.cpp

Not using LLAMA_METAL results in hanging on ggml_graph_compute_helper() in llama.cpp

I re-ran convert (f32) and 4_k quantization using the latest codebase just in case there was some model change but that didn't help. I also tried q4_0 to see if the k-quant stuff was the culprit but that didn't help either.

However, rolling back to commit d2a4366 seems to fix the issue in both the LLAMA_METAL and non-metal cases

I am running on a Macbook with 64GB

main: build = 916 (b5472ea)
main: seed = 1690434757
llama.cpp: loading model from ./models/ggml-model-q4_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 38529.71 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size = 5120.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x149e0a7b0
ggml_metal_init: loaded kernel_add_row 0x149e0adb0
ggml_metal_init: loaded kernel_mul 0x149e0b2d0
ggml_metal_init: loaded kernel_mul_row 0x149e0b900
ggml_metal_init: loaded kernel_scale 0x149e0be20
ggml_metal_init: loaded kernel_silu 0x149e0c340
ggml_metal_init: loaded kernel_relu 0x149e0c860
ggml_metal_init: loaded kernel_gelu 0x149e0cd80
ggml_metal_init: loaded kernel_soft_max 0x149e0d430
ggml_metal_init: loaded kernel_diag_mask_inf 0x149e0da90
ggml_metal_init: loaded kernel_get_rows_f16 0x149e0e110
ggml_metal_init: loaded kernel_get_rows_q4_0 0x149e0e900
ggml_metal_init: loaded kernel_get_rows_q4_1 0x149e0ef80
ggml_metal_init: loaded kernel_get_rows_q2_K 0x149e0f600
ggml_metal_init: loaded kernel_get_rows_q3_K 0x149e0fc80
ggml_metal_init: loaded kernel_get_rows_q4_K 0x149e10300
ggml_metal_init: loaded kernel_get_rows_q5_K 0x149e10980
ggml_metal_init: loaded kernel_get_rows_q6_K 0x149e11000
ggml_metal_init: loaded kernel_rms_norm 0x149e116c0
ggml_metal_init: loaded kernel_norm 0x149e11ee0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x149e12740
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x149e12e00
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x149e134c0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x149e13d00
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x149e143c0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x149e14a80
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x149e15120
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x149e159c0
ggml_metal_init: loaded kernel_rope 0x149e15ee0
ggml_metal_init: loaded kernel_alibi_f32 0x149e16a00
ggml_metal_init: loaded kernel_cpy_f32_f16 0x149e17290
ggml_metal_init: loaded kernel_cpy_f32_f32 0x149e17b20
ggml_metal_init: loaded kernel_cpy_f16_f16 0x149e18290
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 36864.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 866.05 MB, offs = 38439649280, (37730.50 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 24.17 MB, (37754.67 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 5122.00 MB, (42876.67 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 597.00 MB, (43473.67 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 384.00 MB, (43857.67 / 49152.00)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 16, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 50, tfs_z = 1.000000, top_p = 0.700000, typical_p = 1.000000, temp = 0.900000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions