Skip to content

Conversation

@kimminsu38oo
Copy link
Contributor

This PR adds operator-level profiling to the ggml-cpu backend.

Key Changes

  • Compile Option: Added GGML_CPU_OP_PROFILING to enable this feature.

  • Output: Saves operator execution times in ms to op_profiling.csv

  • Thread Safety: Implemented synchronization barriers to ensure accurate timing in multi-threaded environments.
    Performance

  • Negligible runtime overhead.

Example Output
519691938-f557f2a4-b6e4-4c1d-8754-2a0730345ed3

@am17an
Copy link
Collaborator

am17an commented Dec 1, 2025

Have you tested the effect on overall runtime? Hard to believe that writing to a file and flushing after every op completion has "negligible run-time overhead"

@kimminsu38oo
Copy link
Contributor Author

kimminsu38oo commented Dec 1, 2025

@am17an

Thanks for the feedback. I just ran a benchmark to verify the impact.

Test Env: Intel Xeon E-2388G (8 threads), Prefill/Decode 256 tokens.

Model: LLaMA3.2-3B_Q4_0

Results

  • Original
image
  • Profiling on
image

I observed a latency increased of about 78ms for prefill and 742ms for decode.

You were right that the overhead isn't negligible. However, considering the total end-to-end runtime, it might not be a significant amount.

@kimminsu38oo
Copy link
Contributor Author

kimminsu38oo commented Dec 1, 2025

@am17an
I also tested on a mobile device.

Test Env: Galaxy S24 Ultra (Snapdragon 8gen3) (thread 6), Prefill/Decode 256 tokens.

Model: LLaMA3.2-3B_Q4_0

Results

  • Original
image
  • Profiling on
image

Counterintuitively, the profiling overhead was smaller on the mobile device, even with its more constrained memory bandwidth.

I observed a difference of 61ms for prefill and 217ms for decode.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 1, 2025
@am17an
Copy link
Collaborator

am17an commented Dec 2, 2025

You can see per function times way better using a proper profiler (like Intel vTune or AMD uProf, or for GPU there is Nsight). Adding an ad-hoc csv file does not make sense, we already have test-backend-ops which tests perf for individual operations in a much more statistically viable way. As such this change does not make sense

@kimminsu38oo
Copy link
Contributor Author

kimminsu38oo commented Dec 2, 2025

@am17an

Thanks for the feedback, and thank you for taking an interest in this.

As you mentioned, profiling on Desktop is possible using tools like VTune.
However, when I initially wrote this code, I wanted to perform operator-level breakdown profiling on mobile, and profiling in the mobile environment was quite tricky. (It might be due to my lack of skills. I tried using Android Profiler but failed, and other profilers required a rooted device.)

Also, ggml-opencl provides operator-level profiling, and this motivated me to write the corresponding code as well.

But upon reflection, I realize that such an implementation could clutter the codebase in llama.cpp, which supports multiple backends (such as CUDA and Intel CPU). Thank you for your feedback."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants