Skip to content

Conversation

@lovelyoverflow
Copy link

Motivation

While analyzing MLX performance on my Mac Studio (M4 Max), I realized that visualizing GPU execution patterns is critical for understanding optimization. Currently, there seems to be a lack of documentation on how to leverage Xcode Instruments with MLX.

Changes

  • Added a new guide: guides/profiling_with_instruments.md
  • Explained how to identify "CPU Dispatch Overhead" vs "Fused Kernels" using Metal System Trace.
  • Included a sample code snippet for profiling.

Context

I am a student aiming to become an inference optimization engineer. I found that MLX's kernel fusion drastically reduces memory bandwidth pressure compared to PyTorch on Unified Memory architectures. I hope this guide helps other developers optimize their models.

Thank you for your hard work on this amazing framework!

This guide demonstrates how to use Metal System Trace to identify
CPU dispatch overhead vs fused kernels, helping developers optimize
MLX models on Apple Silicon.

Includes:
- Step-by-step profiling workflow using xctrace CLI
- Benchmark script demonstrating kernel fusion benefits
- Visual comparison of eager vs compiled execution patterns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant