Skip to content

Strange results when CLBlast and Metal are both enabled and ngl > 1 #3794

@akx

Description

@akx

I don't know whether it makes sense to enable both CLBlast (in a bid to speed up prompt ingestion) and Metal, but clearly there's something wrong about this combo:

When Llama.cpp is built with -DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_CLBLAST=ON -DLLAMA_BUILD_SERVER=ON, server spews out about 140 lines of ggml_metal_get_buffer: error: buffer is nil.

$ ./server -ngl 16 --no-mmap -m /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf
ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'Apple M2 Max'
ggml_opencl: device FP16 support: false
{"timestamp":1698317012,"level":"INFO","function":"main","line":2213,"message":"build info","build":1429,"commit":"00ae2aa"}
{"timestamp":1698317012,"level":"INFO","function":"main","line":2220,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 19 key-value pairs and 543 tensors from /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf (version GGUF V2 (latest))
llama_model_loader: (...snip...)
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 6656
llm_load_print_meta: n_head           = 52
llm_load_print_meta: n_head_kv        = 52
llm_load_print_meta: n_layer          = 60
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 17920
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 30B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 32.53 B
llm_load_print_meta: model size       = 18.27 GiB (4.83 BPW)
llm_load_print_meta: general.name   = models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 18711.64 MB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: mem required  = 13676.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/61 layers to GPU
llm_load_tensors: VRAM used: 5035.47 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  780.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: (...snip loaded...)
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 103.13 MB
llama_new_context_with_model: max tensor size =   166.63 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 18711.64 MB, (23750.41 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   780.02 MB, (24530.42 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =    97.02 MB, (24627.44 / 49152.00)
ggml_metal_get_buffer: error: buffer is nil [repeated 140 times or so]
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

{"timestamp":1698317017,"level":"INFO","function":"main","line":2495,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache

and the conversation (using all of the server's defaults) is pretty wonky:

User: Hello, Llama! How are you?

Llama: Hi user friendships! I am doing fine thankfully yoursselfnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessness

(repeated forever).

While generating, the console is similarly full of ggml_metal_get_buffer: error: buffer is nil.

After rebuilding with -DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON, with the same command line, there are no ggml_metal_get_buffer: error: buffer is nils and the conversation is back to normal:

User: Hello, Llama! How are you?

Llama: I'm doing great, thank you for asking! And how about yourself?

The results seem to depend on the -ngl setting; without -ngl set, the CLBlasted Llama seems to be responding fine, but with e.g. 32,

User: Hello, Llama! How are you?

Llama: Hiya! fine tuned ready readyreadyReadyReady readyyyahooahoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

Environment and Context

  • llama.cpp version: that of Try cwd for ggml-metal.metal if bundle lookup fails #3793, so tag b1428 + 1 commit
  • Physical (or virtual) hardware you are using: MacBook Pro, Apple M2 Max
  • Operating System: macOS Ventura 13.6
  • SDK version: Apple clang version 15.0.0 (clang-1500.0.40.1)
  • CLBlast version: clblast: stable 1.6.1 (bottled) from homebrew

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions