Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to generate an MPS gputrace #6506

Open
tomsanbear opened this issue Apr 5, 2024 · 10 comments
Open

Question: How to generate an MPS gputrace #6506

tomsanbear opened this issue Apr 5, 2024 · 10 comments
Labels
help wanted Extra attention is needed high priority Very important issue

Comments

@tomsanbear
Copy link

tomsanbear commented Apr 5, 2024

We're doing some work over at https://github.com/huggingface/candle to improve our Metal backend, I've been collecting various gputraces for the different frameworks and was wondering if there was a documented/known way to generate one for llama.cpp during model inference.

Specifically talking about this type of debugger output: https://developer.apple.com/documentation/xcode/metal-debugger

@ggerganov
Copy link
Owner

Unfortunately we don't have any docs. At some point I spent a considerable amount of time trying to learn Metal Debugger / Xcode Instruments in order to generate some useful information about the Metal performance, but I just got completely lost.

If someone who is more familiar with Metal and is interested in contributing, it can be a very useful addition to write some instructions how to do profiling with this tools.

@ggerganov ggerganov added good first issue Good for newcomers help wanted Extra attention is needed and removed good first issue Good for newcomers labels Apr 5, 2024
@tomsanbear
Copy link
Author

Thanks for the information, I'm familiar from the Rust side so let me see if it's easy enough to port to this repository 👍

@bitxsw93
Copy link

bitxsw93 commented Sep 4, 2024

Is there any method to see each metal shader time cost when inference now?

I am lost how to profile each shader, is anyone provide a method, thanks~

@tomsanbear
Copy link
Author

Hey @bitxsw93, I realize I forgot to post back here with the change required to dump timings:

Here is the change you can make to output a gputrace file to /tmp/llamacpp.gputrace, you can then open that trace file with XCode to view the trace.

diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
index 91b5e61b..7651cbd4 100644
--- a/ggml/src/ggml-metal.m
+++ b/ggml/src/ggml-metal.m
@@ -7,6 +7,7 @@

 #import <Metal/Metal.h>

+
 #undef MIN
 #undef MAX
 #define MIN(a, b) ((a) < (b) ? (a) : (b))
@@ -452,7 +453,7 @@ static void ggml_metal_log(enum ggml_log_level level, const char * format, ...){
     GGML_METAL_LOG_INFO("%s: simdgroup matrix mul. support = %s\n",       __func__, ctx->support_simdgroup_mm ? "true" : "false");
     GGML_METAL_LOG_INFO("%s: hasUnifiedMemory              = %s\n",       __func__, ctx->device.hasUnifiedMemory ? "true" : "false");

-    ctx->should_capture_next_compute = false;
+    ctx->should_capture_next_compute = true;

 #if TARGET_OS_OSX || (TARGET_OS_IOS && __clang_major__ >= 15)
     if (@available(macOS 10.12, iOS 16.0, *)) {
@@ -891,6 +892,8 @@ static enum ggml_status ggml_metal_graph_compute(

         MTLCaptureDescriptor * descriptor = [MTLCaptureDescriptor new];
         descriptor.captureObject = ctx->queue;
+        descriptor.destination = MTLCaptureDestinationGPUTraceDocument;
+        descriptor.outputURL = [NSURL fileURLWithPath:[NSString stringWithFormat:@"/tmp/llamacpp.gputrace"]];

         NSError * error = nil;
         if (![[MTLCaptureManager sharedCaptureManager] startCaptureWithDescriptor:descriptor error:&error]) {

@tomsanbear
Copy link
Author

tomsanbear commented Sep 5, 2024

@ggerganov pretty small change required to get these outputs during development, not sure if you think it's worth integrating this in with a more configurable setting somewhere to enable the dump.

@ggerganov
Copy link
Owner

I tried to apply the path and see if it works, but I get the following error:

make -j && ./llama-cli -m ./models/tinyllama-1b/ggml-model-f16.gguf -p "I believe the meaning of life is" -n 32

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
llama_kv_cache_init:      Metal KV buffer size =    44.00 MiB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:      Metal compute buffer size =   148.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute: error: unable to start capture 'Capturing is not supported.'
ggml/src/ggml-metal.m:900: capture failed
Abort trap: 6

Any ideas what could be wrong?

@tomsanbear
Copy link
Author

AH right, there is a super secret super fun environment variable you also need to use: METAL_CAPTURE_ENABLED=1

https://developer.apple.com/documentation/xcode/capturing-a-metal-workload-programmatically

@ggerganov
Copy link
Owner

Oh wow, this is super cool! This is very useful and it looks like something we can use to improve the Metal backend performance (some compute gaps between the encoders are immediately visible):

image

@ggerganov pretty small change required to get these outputs during development, not sure if you think it's worth integrating this in with a more configurable setting somewhere to enable the dump.

Sure, we should add some way to do this. Open to suggestions. Maybe an environment variable specifying the n-th compute call to trace (because we usually don't want to trace the warm-up compute loop).

@jmousseau
Copy link
Contributor

jmousseau commented Sep 16, 2024

Sure, we should add some way to do this. Open to suggestions. Maybe an environment variable specifying the n-th compute call to trace (because we usually don't want to trace the warm-up compute loop).

Maybe rename the existing ggml_backend_metal_capture_next_compute to ggml_backend_metal_capture_compute and add int delay and char *capture_path parameters? There are some (admittedly brief) instructions in the original capture PR.

@ggerganov
Copy link
Owner

Yup, that's an option. But maybe to avoid changing the ggml-metal.h interface, the logic for when to call ggml_backend_metal_capture_next_compute can be implemented in llama.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed high priority Very important issue
Projects
None yet
Development

No branches or pull requests

4 participants