metal : fix kernel_norm #3057

ggerganov · 2023-09-07T11:12:38Z

Edit

After looking at the solution in #3060 I now understand where is my mistake. There is a missing barrier after computing mean.

Ignore everything below

The norm kernel used by Falcon was broken in #2959

After fixing it, there still remains a very strange behaviour with it, that I have explained in the comments in the kernel. Basically, when the 2 loops for computing the mean and the variance are combined into one, the inference breaks for threadgroup size of 256 for that kernel. If I tune down the threadgroup size to 128 or 32 threads, it works again.

I spent about an hour looking into this and I have no idea what is going on.
The only possible hint is maybe we are somehow exceeding some threadgroup memory limit and Metal fails silently?

Seems unlikely, but I have no other guess. If someone else can also look into this would be great and please let me know if you can confirm these observations:

First run the following command and note down the first few ppl numbers:

./bin/perplexity -m ../models/falcon-7b/ggml-model-q4_0.gguf -f ../build/wikitext-2-raw/wiki.test.raw -ngl 1 -t 4

[1]5.1857,[2]6.3693,[3]6.6081,[4]7.5008,[5]7.5353,[6]7.4923,[7]7.6464,[8]7.8036,[9]8.1274,[10]8.2545

Now apply this patch and repeat:

--- a/ggml-metal.metal
+++ b/ggml-metal.metal
@@ -229,20 +229,9 @@ kernel void kernel_norm(
 
     // recenter
     device float * y = dst + tgpig*ne00;
-    for (int i00 = tpitg; i00 < ne00; i00 += ntg) {
-        y[i00] = x[i00] - mean;
-    }
-
-    // VARIANCE
-    // parallel sum
-    //
-    // WARNING: combining this loop with the one above will give you wrong results for nth == 256
-    //          I have no idea why, so for now I am keeping them separate. But this behavior is very concerning.
-    //          Tested with:
-    //          ./perplexity -m ./falcon-7b/ggml-model-q4_0.gguf -f wiki.test.raw -ngl 1 -t 4
-    //
     sum[tpitg] = 0.0f;
     for (int i00 = tpitg; i00 < ne00; i00 += ntg) {
+        y[i00] = x[i00] - mean;
         sum[tpitg] += y[i00] * y[i00];
     }

./bin/perplexity -m ../models/falcon-7b/ggml-model-q4_0.gguf -f ../build/wikitext-2-raw/wiki.test.raw -ngl 1 -t 4

[1]5.2692,[2]6.4269,[3]6.6902,[4]7.5840,[5]7.6415,[6]7.6029,[7]7.7632,[8]7.9142,[9]8.2509,[10]8.3694
[1]5.2684,[2]6.4327,[3]6.6964,[4]7.5955,[5]7.6512,[6]7.6121,[7]7.7719,[8]7.9235,[9]8.2577,[10]8.3772

The numbers are not the same and are different each run.

Try to change the thraedgroup size to 32 and observe that the numbers are back to normal (within numerical variation of course):

--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -1050,7 +1050,7 @@ void ggml_metal_graph_compute(
                             float eps;
                             memcpy(&eps, dst->op_params, sizeof(float));
 
-                            const int nth = 256;
+                            const int nth = 32;
 
                             [encoder setComputePipelineState:ctx->pipeline_norm];
                             [encoder setBuffer:id_src0 offse

./bin/perplexity -m ../models/falcon-7b/ggml-model-q4_0.gguf -f ../build/wikitext-2-raw/wiki.test.raw -ngl 1 -t 4

[1]5.1853,[2]6.3691,[3]6.6080,[4]7.5007,[5]7.5353,[6]7.4923,[7]7.6464,[8]7.8036,[9]8.1274,[10]8.2545

ggml-ci

It works after the norm fixes

ggml-ci

ggerganov · 2023-09-07T12:48:54Z

Merging this to fix Falcon on master, but please take a look and let me know if you confirm my observations

ikawrakow · 2023-09-07T13:51:28Z

In my case it is 100% reproducible:

iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null 
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,^C
iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,[8]7.6739,[9]7.9901,^C
iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,[8]7.6739,[9]7.9901,[10]8.1093,^C
iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,[8]7.6739,[9]7.9901,^C

The 256 did not come from me anyhow and I did not play with it at all.

ggerganov · 2023-09-07T13:57:28Z

Yes, I was missing a memory barrier after computing the mean. Took me a lot of time to figure it out

ggerganov added 4 commits September 7, 2023 14:11

metal : fix kernel_norm

5e1c408

ggml-ci

metal : put warning in kernel_norm to not combine the loops

ed92c3d

metal : restore original F16 mat-vec multiplication

7833796

It works after the norm fixes

common : don't do warm-up with more than n_batch tokens (close #3058)

efac2d4

ggml-ci

ggerganov force-pushed the metal-fix-norm branch from dad4ef2 to efac2d4 Compare September 7, 2023 12:32

metal : minor

2f689de

ggerganov requested review from ikawrakow, lshzh-ww and jhen0409 September 7, 2023 12:45

ggerganov merged commit c4f4966 into master Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : fix kernel_norm #3057

metal : fix kernel_norm #3057

Uh oh!

ggerganov commented Sep 7, 2023 •

edited

Loading

Uh oh!

ggerganov commented Sep 7, 2023

Uh oh!

ikawrakow commented Sep 7, 2023

Uh oh!

ggerganov commented Sep 7, 2023

Uh oh!

Uh oh!

metal : fix kernel_norm #3057

metal : fix kernel_norm #3057

Uh oh!

Conversation

ggerganov commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit

Uh oh!

ggerganov commented Sep 7, 2023

Uh oh!

ikawrakow commented Sep 7, 2023

Uh oh!

ggerganov commented Sep 7, 2023

Uh oh!

Uh oh!

ggerganov commented Sep 7, 2023 •

edited

Loading