Question: How to generate an MPS gputrace #6506

tomsanbear · 2024-04-05T14:08:32Z

We're doing some work over at https://github.com/huggingface/candle to improve our Metal backend, I've been collecting various gputraces for the different frameworks and was wondering if there was a documented/known way to generate one for llama.cpp during model inference.

Specifically talking about this type of debugger output: https://developer.apple.com/documentation/xcode/metal-debugger

ggerganov · 2024-04-05T18:54:14Z

Unfortunately we don't have any docs. At some point I spent a considerable amount of time trying to learn Metal Debugger / Xcode Instruments in order to generate some useful information about the Metal performance, but I just got completely lost.

If someone who is more familiar with Metal and is interested in contributing, it can be a very useful addition to write some instructions how to do profiling with this tools.

tomsanbear · 2024-04-05T21:25:57Z

Thanks for the information, I'm familiar from the Rust side so let me see if it's easy enough to port to this repository 👍

bitxsw93 · 2024-09-04T07:04:44Z

Is there any method to see each metal shader time cost when inference now?

I am lost how to profile each shader, is anyone provide a method, thanks~

tomsanbear · 2024-09-05T13:13:27Z

Hey @bitxsw93, I realize I forgot to post back here with the change required to dump timings:

Here is the change you can make to output a gputrace file to /tmp/llamacpp.gputrace, you can then open that trace file with XCode to view the trace.

diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
index 91b5e61b..7651cbd4 100644
--- a/ggml/src/ggml-metal.m
+++ b/ggml/src/ggml-metal.m
@@ -7,6 +7,7 @@

 #import <Metal/Metal.h>

+
 #undef MIN
 #undef MAX
 #define MIN(a, b) ((a) < (b) ? (a) : (b))
@@ -452,7 +453,7 @@ static void ggml_metal_log(enum ggml_log_level level, const char * format, ...){
     GGML_METAL_LOG_INFO("%s: simdgroup matrix mul. support = %s\n",       __func__, ctx->support_simdgroup_mm ? "true" : "false");
     GGML_METAL_LOG_INFO("%s: hasUnifiedMemory              = %s\n",       __func__, ctx->device.hasUnifiedMemory ? "true" : "false");

-    ctx->should_capture_next_compute = false;
+    ctx->should_capture_next_compute = true;

 #if TARGET_OS_OSX || (TARGET_OS_IOS && __clang_major__ >= 15)
     if (@available(macOS 10.12, iOS 16.0, *)) {
@@ -891,6 +892,8 @@ static enum ggml_status ggml_metal_graph_compute(

         MTLCaptureDescriptor * descriptor = [MTLCaptureDescriptor new];
         descriptor.captureObject = ctx->queue;
+        descriptor.destination = MTLCaptureDestinationGPUTraceDocument;
+        descriptor.outputURL = [NSURL fileURLWithPath:[NSString stringWithFormat:@"/tmp/llamacpp.gputrace"]];

         NSError * error = nil;
         if (![[MTLCaptureManager sharedCaptureManager] startCaptureWithDescriptor:descriptor error:&error]) {

tomsanbear · 2024-09-05T13:15:10Z

@ggerganov pretty small change required to get these outputs during development, not sure if you think it's worth integrating this in with a more configurable setting somewhere to enable the dump.

ggerganov · 2024-09-05T14:16:37Z

I tried to apply the path and see if it works, but I get the following error:

make -j && ./llama-cli -m ./models/tinyllama-1b/ggml-model-f16.gguf -p "I believe the meaning of life is" -n 32

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
llama_kv_cache_init:      Metal KV buffer size =    44.00 MiB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:      Metal compute buffer size =   148.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute: error: unable to start capture 'Capturing is not supported.'
ggml/src/ggml-metal.m:900: capture failed
Abort trap: 6

Any ideas what could be wrong?

tomsanbear · 2024-09-10T20:48:01Z

AH right, there is a super secret super fun environment variable you also need to use: METAL_CAPTURE_ENABLED=1

https://developer.apple.com/documentation/xcode/capturing-a-metal-workload-programmatically

ggerganov · 2024-09-12T07:44:37Z

Oh wow, this is super cool! This is very useful and it looks like something we can use to improve the Metal backend performance (some compute gaps between the encoders are immediately visible):

@ggerganov pretty small change required to get these outputs during development, not sure if you think it's worth integrating this in with a more configurable setting somewhere to enable the dump.

Sure, we should add some way to do this. Open to suggestions. Maybe an environment variable specifying the n-th compute call to trace (because we usually don't want to trace the warm-up compute loop).

jmousseau · 2024-09-16T22:35:34Z

Sure, we should add some way to do this. Open to suggestions. Maybe an environment variable specifying the n-th compute call to trace (because we usually don't want to trace the warm-up compute loop).

Maybe rename the existing ggml_backend_metal_capture_next_compute to ggml_backend_metal_capture_compute and add int delay and char *capture_path parameters? There are some (admittedly brief) instructions in the original capture PR.

ggerganov · 2024-09-17T06:35:42Z

Yup, that's an option. But maybe to avoid changing the ggml-metal.h interface, the logic for when to call ggml_backend_metal_capture_next_compute can be implemented in llama.cpp.

ggerganov added good first issue Good for newcomers help wanted Extra attention is needed and removed good first issue Good for newcomers labels Apr 5, 2024

ggerganov added the high priority Very important issue label Sep 12, 2024

ggerganov mentioned this issue Sep 16, 2024

metal : increase GPU duty-cycle during inference #9507

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to generate an MPS gputrace #6506

Question: How to generate an MPS gputrace #6506

tomsanbear commented Apr 5, 2024 •

edited

Loading

ggerganov commented Apr 5, 2024

tomsanbear commented Apr 5, 2024

bitxsw93 commented Sep 4, 2024

tomsanbear commented Sep 5, 2024

tomsanbear commented Sep 5, 2024 •

edited

Loading

ggerganov commented Sep 5, 2024

tomsanbear commented Sep 10, 2024

ggerganov commented Sep 12, 2024

jmousseau commented Sep 16, 2024 •

edited

Loading

ggerganov commented Sep 17, 2024

Question: How to generate an MPS gputrace #6506

Question: How to generate an MPS gputrace #6506

Comments

tomsanbear commented Apr 5, 2024 • edited Loading

ggerganov commented Apr 5, 2024

tomsanbear commented Apr 5, 2024

bitxsw93 commented Sep 4, 2024

tomsanbear commented Sep 5, 2024

tomsanbear commented Sep 5, 2024 • edited Loading

ggerganov commented Sep 5, 2024

tomsanbear commented Sep 10, 2024

ggerganov commented Sep 12, 2024

jmousseau commented Sep 16, 2024 • edited Loading

ggerganov commented Sep 17, 2024

tomsanbear commented Apr 5, 2024 •

edited

Loading

tomsanbear commented Sep 5, 2024 •

edited

Loading

jmousseau commented Sep 16, 2024 •

edited

Loading