Skip to content

metal : fix kernel_norm #3057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 7, 2023
Merged

metal : fix kernel_norm #3057

merged 5 commits into from
Sep 7, 2023

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 7, 2023

Edit

After looking at the solution in #3060 I now understand where is my mistake. There is a missing barrier after computing mean.

Ignore everything below


The norm kernel used by Falcon was broken in #2959

After fixing it, there still remains a very strange behaviour with it, that I have explained in the comments in the kernel. Basically, when the 2 loops for computing the mean and the variance are combined into one, the inference breaks for threadgroup size of 256 for that kernel. If I tune down the threadgroup size to 128 or 32 threads, it works again.

I spent about an hour looking into this and I have no idea what is going on.
The only possible hint is maybe we are somehow exceeding some threadgroup memory limit and Metal fails silently?

image

image

Seems unlikely, but I have no other guess. If someone else can also look into this would be great and please let me know if you can confirm these observations:

First run the following command and note down the first few ppl numbers:

./bin/perplexity -m ../models/falcon-7b/ggml-model-q4_0.gguf -f ../build/wikitext-2-raw/wiki.test.raw -ngl 1 -t 4

[1]5.1857,[2]6.3693,[3]6.6081,[4]7.5008,[5]7.5353,[6]7.4923,[7]7.6464,[8]7.8036,[9]8.1274,[10]8.2545

Now apply this patch and repeat:

--- a/ggml-metal.metal
+++ b/ggml-metal.metal
@@ -229,20 +229,9 @@ kernel void kernel_norm(
 
     // recenter
     device float * y = dst + tgpig*ne00;
-    for (int i00 = tpitg; i00 < ne00; i00 += ntg) {
-        y[i00] = x[i00] - mean;
-    }
-
-    // VARIANCE
-    // parallel sum
-    //
-    // WARNING: combining this loop with the one above will give you wrong results for nth == 256
-    //          I have no idea why, so for now I am keeping them separate. But this behavior is very concerning.
-    //          Tested with:
-    //          ./perplexity -m ./falcon-7b/ggml-model-q4_0.gguf -f wiki.test.raw -ngl 1 -t 4
-    //
     sum[tpitg] = 0.0f;
     for (int i00 = tpitg; i00 < ne00; i00 += ntg) {
+        y[i00] = x[i00] - mean;
         sum[tpitg] += y[i00] * y[i00];
     }
 
./bin/perplexity -m ../models/falcon-7b/ggml-model-q4_0.gguf -f ../build/wikitext-2-raw/wiki.test.raw -ngl 1 -t 4

[1]5.2692,[2]6.4269,[3]6.6902,[4]7.5840,[5]7.6415,[6]7.6029,[7]7.7632,[8]7.9142,[9]8.2509,[10]8.3694
[1]5.2684,[2]6.4327,[3]6.6964,[4]7.5955,[5]7.6512,[6]7.6121,[7]7.7719,[8]7.9235,[9]8.2577,[10]8.3772

The numbers are not the same and are different each run.

Try to change the thraedgroup size to 32 and observe that the numbers are back to normal (within numerical variation of course):

--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -1050,7 +1050,7 @@ void ggml_metal_graph_compute(
                             float eps;
                             memcpy(&eps, dst->op_params, sizeof(float));
 
-                            const int nth = 256;
+                            const int nth = 32;
 
                             [encoder setComputePipelineState:ctx->pipeline_norm];
                             [encoder setBuffer:id_src0 offse
./bin/perplexity -m ../models/falcon-7b/ggml-model-q4_0.gguf -f ../build/wikitext-2-raw/wiki.test.raw -ngl 1 -t 4

[1]5.1853,[2]6.3691,[3]6.6080,[4]7.5007,[5]7.5353,[6]7.4923,[7]7.6464,[8]7.8036,[9]8.1274,[10]8.2545

@ggerganov
Copy link
Member Author

Merging this to fix Falcon on master, but please take a look and let me know if you confirm my observations

@ggerganov ggerganov merged commit c4f4966 into master Sep 7, 2023
@ikawrakow
Copy link
Contributor

In my case it is 100% reproducible:

iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null 
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,^C
iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,[8]7.6739,[9]7.9901,^C
iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,[8]7.6739,[9]7.9901,[10]8.1093,^C
iwan@Iwans-MBP:~/other/llama.cpp/build$ ./bin/perplexity -m ../models/f7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 8 -ngl 100 2>/dev/null
[1]5.1179,[2]6.3082,[3]6.5855,[4]7.4463,[5]7.4582,[6]7.3921,[7]7.5325,[8]7.6739,[9]7.9901,^C

The 256 did not come from me anyhow and I did not play with it at all.

@ggerganov
Copy link
Member Author

Yes, I was missing a memory barrier after computing the mean. Took me a lot of time to figure it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants