-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
I don't know whether it makes sense to enable both CLBlast (in a bid to speed up prompt ingestion) and Metal, but clearly there's something wrong about this combo:
When Llama.cpp is built with -DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_CLBLAST=ON -DLLAMA_BUILD_SERVER=ON, server spews out about 140 lines of ggml_metal_get_buffer: error: buffer is nil.
$ ./server -ngl 16 --no-mmap -m /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf
ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'Apple M2 Max'
ggml_opencl: device FP16 support: false
{"timestamp":1698317012,"level":"INFO","function":"main","line":2213,"message":"build info","build":1429,"commit":"00ae2aa"}
{"timestamp":1698317012,"level":"INFO","function":"main","line":2220,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 19 key-value pairs and 543 tensors from /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf (version GGUF V2 (latest))
llama_model_loader: (...snip...)
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 6656
llm_load_print_meta: n_head = 52
llm_load_print_meta: n_head_kv = 52
llm_load_print_meta: n_layer = 60
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 17920
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 30B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model params = 32.53 B
llm_load_print_meta: model size = 18.27 GiB (4.83 BPW)
llm_load_print_meta: general.name = models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 18711.64 MB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: mem required = 13676.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/61 layers to GPU
llm_load_tensors: VRAM used: 5035.47 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: (...snip loaded...)
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 103.13 MB
llama_new_context_with_model: max tensor size = 166.63 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 18711.64 MB, (23750.41 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 780.02 MB, (24530.42 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 97.02 MB, (24627.44 / 49152.00)
ggml_metal_get_buffer: error: buffer is nil [repeated 140 times or so]
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://127.0.0.1:8080
{"timestamp":1698317017,"level":"INFO","function":"main","line":2495,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
and the conversation (using all of the server's defaults) is pretty wonky:
User: Hello, Llama! How are you?
Llama: Hi user friendships! I am doing fine thankfully yoursselfnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessness
(repeated forever).
While generating, the console is similarly full of ggml_metal_get_buffer: error: buffer is nil.
After rebuilding with -DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON, with the same command line, there are no ggml_metal_get_buffer: error: buffer is nils and the conversation is back to normal:
User: Hello, Llama! How are you?
Llama: I'm doing great, thank you for asking! And how about yourself?
The results seem to depend on the -ngl setting; without -ngl set, the CLBlasted Llama seems to be responding fine, but with e.g. 32,
User: Hello, Llama! How are you?
Llama: Hiya! fine tuned ready readyreadyReadyReady readyyyahooahoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
Environment and Context
- llama.cpp version: that of Try cwd for ggml-metal.metal if bundle lookup fails #3793, so tag
b1428+ 1 commit - Physical (or virtual) hardware you are using: MacBook Pro, Apple M2 Max
- Operating System: macOS Ventura 13.6
- SDK version: Apple clang version 15.0.0 (clang-1500.0.40.1)
- CLBlast version: clblast: stable 1.6.1 (bottled) from homebrew