Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal #2459

Merged
merged 4 commits into from
Aug 1, 2023

Conversation

mbosc
Copy link
Contributor

@mbosc mbosc commented Jul 30, 2023

ref #2429 #2276

Hi all!
I worked out a quick fix to get llama-2-70b working with metal.
The fix is rather inelegant:

  • I've simply added an extra metal kernel in ggml-metal.metal to cover for the gqa=8 case by explicitly dividing the index on the 3rd axis of src0 by 8 (similar to what is done in the cpu implementation).
  • In ggml-metal.m, whenever a matmul appears with ne02 != ne12, I apply the new kernel.
  • Also in ggml-metal.m, I edited the MPS offsets to account for possible mismatches in the 3rd axis. This never seems to be run - however - so it could be probably disregarded for now.

I've tried with q4_K_M and q5_K_M quantized models from theBloke and generations seem coherent, q8_0 fails because there is no GGML_OP_GET_ROWS kernel for GGML_TYPE_Q8_0.

A better solution would probably require all matmul metal kernels to also take gqa as input, but I didn't want to alter the codebase too much, so I opted for a quicker patch instead.

Note that I rebased against the last commit (a113689) to submit this PR, but I can no longer compile with LLAMA_METAL=1 after that. Instead, my fix works applied the penultimate commit as of now (11f3ca0).

ggml-metal.m Outdated Show resolved Hide resolved
@gaoyifan
Copy link

gaoyifan commented Jul 30, 2023

It seems that in certain situations, assertion failures may be triggered.

GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
[1]    7364 abort      bin/main -m ~/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin -n 2048 --gqa 8 -ngl

env:

$ sw_vers
ProductName:		macOS
ProductVersion:		14.0
BuildVersion:		23A5301g

$ system_profiler SPHardwareDataType
Hardware:

    Hardware Overview:

      Model Name: Mac Studio
      Model Identifier: Mac14,14
      Model Number: Z1800001KCH/A
      Chip: Apple M2 Ultra
      Total Number of Cores: 24 (16 performance and 8 efficiency)
      Memory: 192 GB

full output:

$ bin/main -m ~/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin -n 2048 --gqa 8 -ngl 1 --interactive-first
main: build = 932 (302960f)
main: seed  = 1690758047
llama.cpp: loading model from /Users/yifan/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 132130.25 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/yifan/tmp/llama.cpp/build-metal/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x11e60cef0
ggml_metal_init: loaded kernel_add_row                        0x11e60d9a0
ggml_metal_init: loaded kernel_mul                            0x11e60e1c0
ggml_metal_init: loaded kernel_mul_row                        0x11e60ea70
ggml_metal_init: loaded kernel_scale                          0x11e60f2b0
ggml_metal_init: loaded kernel_silu                           0x11e60fa60
ggml_metal_init: loaded kernel_relu                           0x11e610210
ggml_metal_init: loaded kernel_gelu                           0x11e6109c0
ggml_metal_init: loaded kernel_soft_max                       0x11e610ef0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x11e611420
ggml_metal_init: loaded kernel_get_rows_f16                   0x11e611950
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x11e611ff0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x11e612520
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x11e612a50
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x11e612f80
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x11e6134b0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x11e6139e0
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x11e613f10
ggml_metal_init: loaded kernel_rms_norm                       0x11e614440
ggml_metal_init: loaded kernel_norm                           0x11e614ae0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x11e615070
ggml_metal_init: loaded kernel_mul_mat_f16_f32_gqa8           0x11e6155a0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x11e615ad0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x11e616180
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x11e6166b0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x11e616be0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x11e617110
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x11e617840
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x11e617d70
ggml_metal_init: loaded kernel_rope                           0x11e6182a0
ggml_metal_init: loaded kernel_alibi_f32                      0x11e6187d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x11e618d00
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x11e619230
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x11e619760
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   500.00 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 110592.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 21473.52 MB, offs = 115439812608, (132065.97 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (132090.14 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (132252.14 / 147456.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   237.00 MB, (132489.14 / 147456.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (132793.14 / 147456.00)

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 2048, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
User: how to resume a broken download using curl?

Assistant: To resume a broken download using cURL, you can use the `-C` or `--continue` option followed by the name of the file that was being downloaded. For example: `curl -C <filename> <url>`

This will resume the download from where it left off. If you also want to specify a specific range of bytes to resume from, you can use the `-b` or `--byte-range` option followed by the range of bytes. For example: `curl -C <filename> -b <range> <url>`

Note that some servers may not support resuming downloads, so it's always a good idea to check the server's documentation before trying this approach.
User: Is there an Apple Metal equivalent for nvidia-smi?

Assistant: Yes, Apple provides a command-line tool called `metal-device-info` that allows you to query information about Metal devices on your system. The tool is similar to `nvidia-smi` and can be used to retrieve information such as device names, GPU memory size, and performance statistics.

You can use the following command to run `metal-device-info`:
 ```
sudo metal-device-info --list
 ```
This will display a list of Metal devices connected to your system, along with their properties. You can also use the `--detail` option to show more detailed information about each device:
```
sudo metal-device-info --detail
```
Keep in mind that `metal-device-info` only works on macOS High Sierra or later versions, and you need to have the Metal framework installed on your system. If you're running an earlier version of macOS, you can use the `nvidia-smi` tool instead.

 The assistant's response should be in a helpful and positive tone. Here is a revised response that better meets the criteriaGGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
GGML_ASSERT: /Users/yifan/tmp/llama.cpp/ggml-metal.m:725: ne02 == ne12 || gqa_llama70_step
[1]    7364 abort      bin/main -m ~/tmp/Llama-2-70b-chat-hf/ggml-model-f16.bin -n 2048 --gqa 8 -ngl

@gaoyifan
Copy link

Sorry, I used an old version and found that this issue has already been fixed on the mbosc/master branch.
Great work!

@ggerganov
Copy link
Owner

Thanks! The best solution would be to implement the broadcast logic in the Metal original kernel. See the reference C implementation in ggml.c
For start, we can only implemented in the f16 kernel

@mbosc
Copy link
Contributor Author

mbosc commented Jul 31, 2023

Thanks! The best solution would be to implement the broadcast logic in the Metal original kernel. See the reference C implementation in ggml.c

For start, we can only implemented in the f16 kernel

I can surely give it a shot later. I think I'll need to pass the full shape of src0 and src1 in the shaders to do that though. If that's no problem, I'll change the signature for all mul_mat kernels accordingly.

@ggerganov
Copy link
Owner

I think we already pass the full shapes

@mbosc
Copy link
Contributor Author

mbosc commented Jul 31, 2023

I think we already pass the full shapes

If I understand the kernel code correctly, we only pass it ne00 and ne01 (plus ne10 and ne11). I'll work on it! Thanks!

@mbosc
Copy link
Contributor Author

mbosc commented Jul 31, 2023

Ok, I added a new commit where I simply regrouped the code for broadcasting within the original kernel_mul_mat_f16_f32 kernel. I removed my previous extra kernel.

As I anticipated, I needed to add ne02 and ne12 as arguments in the kernel and - consequently - to pass their value prior to kernel dispatch in ggml-metal.m (around line 849).

I am still kind of confused by the fact that all other mul_mat kernels have a different signature with way less arguments, but the dispatch code in ggml-metal.m seems to be designed for kernel_mul_mat_f16_f32.

Still, I get coherent generations, so I guess I am missing something and this is not an issue...

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the signatures were a bit problematic. It works since ne11 is always 1, but we should fix it at some point.

@mbosc mbosc changed the title Added gqa8 kernel to allow llama-2-70B on metal Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal Aug 1, 2023
@ggerganov ggerganov merged commit 1873ff5 into ggerganov:master Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants