Skip to content

Research: Performance differences between Metal (macOS) and Vulkan (Linux) #10982

Open
@asahilina

Description

@asahilina

I'm one of the developers for the Asahi Linux GPU drivers, which provide accelerated Vulkan and OpenGL support on Apple Silicon platforms. I'm interested in improving the performance of llama.cpp on our drivers with the Vulkan backend.

As things stand today, macOS is significantly faster on a quick test with llama-bench, with default settings (tested on an M2 Max 64GB):

Linux:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 Max (G14C B1) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders................................Done!
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         pp512 |         92.16 ± 0.08 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         21.93 ± 0.02 |

build: 9ba399dfa7f1 (4391)

macOS:

./build/bin/llama-bench -m /Volumes/Untitled/mistral-7b-v0.1.Q4_K_M.gguf 
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS,RPC |       8 |         pp512 |        580.26 ± 8.82 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS,RPC |       8 |         tg128 |         61.18 ± 0.41 |

build: 9ba399df (4391)

(I also tested a larger 70B model which failed to load due to failing to allocate memory on Linux, but that's obviously a separate issue that's easy to debug. Probably just a hardcoded alloc size limit in the driver we can raise, since we recently refactored a bunch of stuff to handle >4G buffers properly.)

Of course, we'd like to improve the driver where possible to make things faster. However, since I know nothing about how LLMs are implemented under the hood, or the state of the llama.cpp Metal and Vulkan backends I would like to ask for help figuring out the perf issues, and analyzing whether llama.cpp itself could also be part of the root cause.

Would you be able to help us out? I'm curious about these things:

  • The state of the Metal vs. Vulkan backends, and whether any perf differences could be expected on the same hardware based on that alone (are the shaders and the way the workload is run essentially identical, or are there major differences?).
  • How to get more information on how the work is scheduled on both Metal and Vulkan (block sizes, shared memory allocations, and things like that), so we can identify if there are any differences or choices at the llama.cpp level that could explain perf differences.
  • How to run smaller micro-benchmarks. To work out driver and shader compiler issues, ideally we'd want to narrow it down to single shaders / compute launches, and measure the performance individually.
  • General info on what to expect and where we should dig deeper. Are things usually memory-bandwidth-bound (I understand that's the case for LLMs)? Or is it likely we'll run into ALU-bound shaders? Is there any heavy synchronization involved, or are we mostly dealing with large compute launches that stand alone? Is cache performance critical, and could differences in data layout or processing order matter, if any?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions