Strange results when CLBlast and Metal are both enabled and ngl > 1

I don't know whether it makes sense to enable both CLBlast (in a bid to speed up prompt ingestion) and Metal, but clearly there's something wrong about this combo:

When Llama.cpp is built with `-DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_CLBLAST=ON -DLLAMA_BUILD_SERVER=ON`, `server` spews out about 140 lines of `ggml_metal_get_buffer: error: buffer is nil`.

```
$ ./server -ngl 16 --no-mmap -m /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf
ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'Apple M2 Max'
ggml_opencl: device FP16 support: false
{"timestamp":1698317012,"level":"INFO","function":"main","line":2213,"message":"build info","build":1429,"commit":"00ae2aa"}
{"timestamp":1698317012,"level":"INFO","function":"main","line":2220,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 19 key-value pairs and 543 tensors from /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf (version GGUF V2 (latest))
llama_model_loader: (...snip...)
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 6656
llm_load_print_meta: n_head           = 52
llm_load_print_meta: n_head_kv        = 52
llm_load_print_meta: n_layer          = 60
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 17920
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 30B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 32.53 B
llm_load_print_meta: model size       = 18.27 GiB (4.83 BPW)
llm_load_print_meta: general.name   = models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 18711.64 MB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: mem required  = 13676.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/61 layers to GPU
llm_load_tensors: VRAM used: 5035.47 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  780.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: (...snip loaded...)
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 103.13 MB
llama_new_context_with_model: max tensor size =   166.63 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 18711.64 MB, (23750.41 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   780.02 MB, (24530.42 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =    97.02 MB, (24627.44 / 49152.00)
ggml_metal_get_buffer: error: buffer is nil [repeated 140 times or so]
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

{"timestamp":1698317017,"level":"INFO","function":"main","line":2495,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
```

and the conversation (using all of the server's defaults) is pretty wonky:

> User: Hello, Llama! How are you?
>
> Llama: Hi user friendships! I am doing fine thankfully yoursselfnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessness

(repeated forever).

While generating, the console is similarly full of `ggml_metal_get_buffer: error: buffer is nil`.

After rebuilding with `-DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON`, with the same command line, there are no `ggml_metal_get_buffer: error: buffer is nil`s and the conversation is back to normal:

> User: Hello, Llama! How are you?
>
> Llama: I'm doing great, thank you for asking! And how about yourself?

The results seem to depend on the `-ngl` setting; without `-ngl` set, the CLBlasted Llama seems to be responding fine, but with e.g. 32,

> User: Hello, Llama! How are you?
>
> Llama: Hiya! fine tuned ready readyreadyReadyReady readyyyahooahoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

# Environment and Context

* llama.cpp version: that of #3793, so tag `b1428` + 1 commit
* Physical (or virtual) hardware you are using: MacBook Pro, Apple M2 Max
* Operating System: macOS Ventura 13.6
* SDK version: Apple clang version 15.0.0 (clang-1500.0.40.1)
* CLBlast version: clblast: stable 1.6.1 (bottled) from homebrew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strange results when CLBlast and Metal are both enabled and ngl > 1 #3794

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strange results when CLBlast and Metal are both enabled and ngl > 1 #3794

Description

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions