-
Notifications
You must be signed in to change notification settings - Fork 13.9k
ggml-cpu: Add operator-level execution time profiling #17657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Have you tested the effect on overall runtime? Hard to believe that writing to a file and flushing after every op completion has "negligible run-time overhead" |
|
@am17an Test Env: Galaxy S24 Ultra (Snapdragon 8gen3) (thread 6), Prefill/Decode 256 tokens. Model: LLaMA3.2-3B_Q4_0 Results
Counterintuitively, the profiling overhead was smaller on the mobile device, even with its more constrained memory bandwidth. I observed a difference of 61ms for prefill and 217ms for decode. |
|
You can see per function times way better using a proper profiler (like Intel vTune or AMD uProf, or for GPU there is Nsight). Adding an ad-hoc csv file does not make sense, we already have |
|
Thanks for the feedback, and thank you for taking an interest in this. As you mentioned, profiling on Desktop is possible using tools like VTune. Also, ggml-opencl provides operator-level profiling, and this motivated me to write the corresponding code as well. But upon reflection, I realize that such an implementation could clutter the codebase in llama.cpp, which supports multiple backends (such as CUDA and Intel CPU). Thank you for your feedback." |




This PR adds operator-level profiling to the ggml-cpu backend.
Key Changes
Compile Option: Added
GGML_CPU_OP_PROFILINGto enable this feature.Output: Saves operator execution times in ms to op_profiling.csv
Thread Safety: Implemented synchronization barriers to ensure accurate timing in multi-threaded environments.
Performance
Negligible runtime overhead.
Example Output
