Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal #2459

mbosc · 2023-07-30T22:04:16Z

Hi all!
I worked out a quick fix to get llama-2-70b working with metal.
The fix is rather inelegant:

I've simply added an extra metal kernel in ggml-metal.metal to cover for the gqa=8 case by explicitly dividing the index on the 3rd axis of src0 by 8 (similar to what is done in the cpu implementation).
In ggml-metal.m, whenever a matmul appears with ne02 != ne12, I apply the new kernel.
Also in ggml-metal.m, I edited the MPS offsets to account for possible mismatches in the 3rd axis. This never seems to be run - however - so it could be probably disregarded for now.

I've tried with q4_K_M and q5_K_M quantized models from theBloke and generations seem coherent, q8_0 fails because there is no GGML_OP_GET_ROWS kernel for GGML_TYPE_Q8_0.

A better solution would probably require all matmul metal kernels to also take gqa as input, but I didn't want to alter the codebase too much, so I opted for a quicker patch instead.

Note that I rebased against the last commit (a113689) to submit this PR, but I can no longer compile with LLAMA_METAL=1 after that. Instead, my fix works applied the penultimate commit as of now (11f3ca0).

ggml-metal.m

gaoyifan · 2023-07-30T23:47:16Z

It seems that in certain situations, assertion failures may be triggered.

GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
[1]    7364 abort      bin/main -m ~/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin -n 2048 --gqa 8 -ngl

env:

$ sw_vers
ProductName:		macOS
ProductVersion:		14.0
BuildVersion:		23A5301g

$ system_profiler SPHardwareDataType
Hardware:

    Hardware Overview:

      Model Name: Mac Studio
      Model Identifier: Mac14,14
      Model Number: Z1800001KCH/A
      Chip: Apple M2 Ultra
      Total Number of Cores: 24 (16 performance and 8 efficiency)
      Memory: 192 GB

full output:

$ bin/main -m ~/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin -n 2048 --gqa 8 -ngl 1 --interactive-first
main: build = 932 (302960f)
main: seed  = 1690758047
llama.cpp: loading model from /Users/yifan/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 132130.25 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/yifan/tmp/llama.cpp/build-metal/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x11e60cef0
ggml_metal_init: loaded kernel_add_row                        0x11e60d9a0
ggml_metal_init: loaded kernel_mul                            0x11e60e1c0
ggml_metal_init: loaded kernel_mul_row                        0x11e60ea70
ggml_metal_init: loaded kernel_scale                          0x11e60f2b0
ggml_metal_init: loaded kernel_silu                           0x11e60fa60
ggml_metal_init: loaded kernel_relu                           0x11e610210
ggml_metal_init: loaded kernel_gelu                           0x11e6109c0
ggml_metal_init: loaded kernel_soft_max                       0x11e610ef0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x11e611420
ggml_metal_init: loaded kernel_get_rows_f16                   0x11e611950
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x11e611ff0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x11e612520
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x11e612a50
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x11e612f80
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x11e6134b0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x11e6139e0
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x11e613f10
ggml_metal_init: loaded kernel_rms_norm                       0x11e614440
ggml_metal_init: loaded kernel_norm                           0x11e614ae0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x11e615070
ggml_metal_init: loaded kernel_mul_mat_f16_f32_gqa8           0x11e6155a0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x11e615ad0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x11e616180
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x11e6166b0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x11e616be0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x11e617110
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x11e617840
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x11e617d70
ggml_metal_init: loaded kernel_rope                           0x11e6182a0
ggml_metal_init: loaded kernel_alibi_f32                      0x11e6187d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x11e618d00
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x11e619230
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x11e619760
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   500.00 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 110592.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 21473.52 MB, offs = 115439812608, (132065.97 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (132090.14 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (132252.14 / 147456.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   237.00 MB, (132489.14 / 147456.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (132793.14 / 147456.00)

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 2048, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
User: how to resume a broken download using curl?

Assistant: To resume a broken download using cURL, you can use the `-C` or `--continue` option followed by the name of the file that was being downloaded. For example: `curl -C <filename> <url>`

This will resume the download from where it left off. If you also want to specify a specific range of bytes to resume from, you can use the `-b` or `--byte-range` option followed by the range of bytes. For example: `curl -C <filename> -b <range> <url>`

Note that some servers may not support resuming downloads, so it's always a good idea to check the server's documentation before trying this approach.
User: Is there an Apple Metal equivalent for nvidia-smi?

Assistant: Yes, Apple provides a command-line tool called `metal-device-info` that allows you to query information about Metal devices on your system. The tool is similar to `nvidia-smi` and can be used to retrieve information such as device names, GPU memory size, and performance statistics.

You can use the following command to run `metal-device-info`:
 ```
sudo metal-device-info --list
 ```
This will display a list of Metal devices connected to your system, along with their properties. You can also use the `--detail` option to show more detailed information about each device:
```
sudo metal-device-info --detail
```
Keep in mind that `metal-device-info` only works on macOS High Sierra or later versions, and you need to have the Metal framework installed on your system. If you're running an earlier version of macOS, you can use the `nvidia-smi` tool instead.

 The assistant's response should be in a helpful and positive tone. Here is a revised response that better meets the criteriaGGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
[1]    7364 abort      bin/main -m ~/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin -n 2048 --gqa 8 -ngl

gaoyifan · 2023-07-31T00:04:05Z

Sorry, I used an old version and found that this issue has already been fixed on the mbosc/master branch.
Great work!

ggerganov · 2023-07-31T06:00:05Z

Thanks! The best solution would be to implement the broadcast logic in the Metal original kernel. See the reference C implementation in ggml.c
For start, we can only implemented in the f16 kernel

mbosc · 2023-07-31T08:56:29Z

Thanks! The best solution would be to implement the broadcast logic in the Metal original kernel. See the reference C implementation in ggml.c

For start, we can only implemented in the f16 kernel

I can surely give it a shot later. I think I'll need to pass the full shape of src0 and src1 in the shaders to do that though. If that's no problem, I'll change the signature for all mul_mat kernels accordingly.

ggerganov · 2023-07-31T09:07:18Z

I think we already pass the full shapes

mbosc · 2023-07-31T09:19:16Z

I think we already pass the full shapes

If I understand the kernel code correctly, we only pass it ne00 and ne01 (plus ne10 and ne11). I'll work on it! Thanks!

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>

mbosc · 2023-07-31T12:46:20Z

Ok, I added a new commit where I simply regrouped the code for broadcasting within the original kernel_mul_mat_f16_f32 kernel. I removed my previous extra kernel.

As I anticipated, I needed to add ne02 and ne12 as arguments in the kernel and - consequently - to pass their value prior to kernel dispatch in ggml-metal.m (around line 849).

I am still kind of confused by the fact that all other mul_mat kernels have a different signature with way less arguments, but the dispatch code in ggml-metal.m seems to be designed for kernel_mul_mat_f16_f32.

Still, I get coherent generations, so I guess I am missing something and this is not an issue...

ggml-metal.m

ggerganov

Yes the signatures were a bit problematic. It works since ne11 is always 1, but we should fix it at some point.

cebtenzzre reviewed Jul 30, 2023

View reviewed changes

ggml-metal.m Outdated Show resolved Hide resolved

mbosc and others added 3 commits July 31, 2023 13:52

Added gqa8 kernel to allow llama-2-70B on metal

ae58ac7

Update ggml-metal.m

fee39ec

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>

Extend kernel_mul_mat_f16_f32 to handle gqa broadcast

d28b07c

mbosc force-pushed the master branch from 187fecf to d28b07c Compare July 31, 2023 12:41

ggerganov reviewed Jul 31, 2023

View reviewed changes

ggml-metal.m Show resolved Hide resolved

ggerganov approved these changes Jul 31, 2023

View reviewed changes

Added ne03==ne13 assertion

995c220

mbosc changed the title ~~Added gqa8 kernel to allow llama-2-70B on metal~~ Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal Aug 1, 2023

ggerganov merged commit 1873ff5 into ggerganov:master Aug 1, 2023

ankuratudemy mentioned this pull request Aug 1, 2023

no support for loading 70B models via llamacpp langchain-ai/langchain#8486

Closed

14 tasks

ruped mentioned this pull request Aug 8, 2023

Add params to support llama2 70B phronmophobic/llama.clj#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal #2459

Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal #2459

mbosc commented Jul 30, 2023

gaoyifan commented Jul 30, 2023 •

edited

Loading

gaoyifan commented Jul 31, 2023

ggerganov commented Jul 31, 2023

mbosc commented Jul 31, 2023

ggerganov commented Jul 31, 2023

mbosc commented Jul 31, 2023

mbosc commented Jul 31, 2023

ggerganov left a comment

Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal #2459

Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal #2459

Conversation

mbosc commented Jul 30, 2023

gaoyifan commented Jul 30, 2023 • edited Loading

gaoyifan commented Jul 31, 2023

ggerganov commented Jul 31, 2023

mbosc commented Jul 31, 2023

ggerganov commented Jul 31, 2023

mbosc commented Jul 31, 2023

mbosc commented Jul 31, 2023

ggerganov left a comment

Choose a reason for hiding this comment

gaoyifan commented Jul 30, 2023 •

edited

Loading