Skip to content

metal: somewhat faster f16 x f32 matrix multiply kernel #2951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 1, 2023

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Sep 1, 2023

Simply via better accumulation of thread results. The larger the context (prompt), the more improvement we see in pp timing.

7B Q4_0 on 30-core M2 Max.

model backend test t/s (Master) ts (PR) Speedup
LLaMA 7B mostly Q4_0 Metal pp 32 370.26 ± 1.59 374.78 ± 0.95 1.2%
LLaMA 7B mostly Q4_0 Metal pp 64 425.70 ± 0.62 433.17 ± 0.46 1.8%
LLaMA 7B mostly Q4_0 Metal pp 128 406.00 ± 0.62 419.64 ± 0.76 3.4%
LLaMA 7B mostly Q4_0 Metal pp 256 350.71 ± 0.15 373.11 ± 0.23 6.4%
LLaMA 7B mostly Q4_0 Metal pp 512 264.21 ± 0.42 290.76 ± 0.29 10.0%

Update:

If we also change the number of thread groups from 64 to 32, it gets even better:

model backend test t/s (Master) ts (PR) Speedup
LLaMA 7B mostly Q4_0 Metal pp 32 370.26 ± 1.59 376.74 ± 2.03 +1.7%
LLaMA 7B mostly Q4_0 Metal pp 64 425.70 ± 0.62 442.49 ± 0.98 +3.9%
LLaMA 7B mostly Q4_0 Metal pp 128 406.00 ± 0.62 437.34 ± 0.66 +7.7%
LLaMA 7B mostly Q4_0 Metal pp 256 350.71 ± 0.15 400.49 ± 0.50 +14.2%
LLaMA 7B mostly Q4_0 Metal pp 512 264.21 ± 0.42 323.87 ± 0.10 +22.6%

It does give a small benefit for TG too. E.g., for TG-128 I get 61.28 +/- 0.16 vs 59.98 ± 0.10 on master.

@ikawrakow ikawrakow requested a review from ggerganov September 1, 2023 07:18
@monatis
Copy link
Collaborator

monatis commented Sep 1, 2023

The speedup column is not so intuitive. It's usually reported as PR / Master, so for pp 512, for example, the speedup should be 1.1 instead of 10.0%.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M2 Ultra

model size params ngl test master t/s PR t/s speedup
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 32 686.11 ± 4.85 684.45 ± 3.89 1.00
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 64 852.72 ± 2.61 859.01 ± 1.66 1.01
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 128 910.65 ± 1.78 927.03 ± 2.15 1.02
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 256 811.85 ± 0.98 837.43 ± 0.99 1.03
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 512 632.02 ± 0.11 663.84 ± 0.14 1.05
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 1024 467.89 ± 0.12 498.51 ± 0.09 1.07
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 tg 128 87.20 ± 0.12 87.46 ± 0.03 1.01

I think this PR does not conflict with #2891 so we can merge it

@ggerganov
Copy link
Member

Here is the speedup after the last commit:

model size params ngl test master t/s PR t/s speedup
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 32 686.11 ± 4.85 689.51 ± 3.60 1.00
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 64 852.72 ± 2.61 862.13 ± 2.97 1.01
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 128 910.65 ± 1.78 945.49 ± 1.71 1.04
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 256 811.85 ± 0.98 873.72 ± 0.16 1.08
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 512 632.02 ± 0.11 711.37 ± 0.26 1.13
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 pp 1024 467.89 ± 0.12 543.60 ± 0.10 1.16
LLaMA 7B Q4_0 3.56 GiB 6.74 B 1 tg 128 87.20 ± 0.12 87.38 ± 0.08 1.00

🦙

@ikawrakow ikawrakow merged commit e8d9158 into master Sep 1, 2023
@ikawrakow ikawrakow deleted the ik/metal_faster_mm_f16_f32 branch September 1, 2023 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants