Skip to content

Operator level microbenchmarking #3154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SSYernar
Copy link
Contributor

@SSYernar SSYernar commented Jul 2, 2025

Summary:
This change introduces microbenchmarking for PyTorch operators.
Since we need to capture and measure each operator call (which is happening under the hood of PyTorch), we need to use torch.profiler.profile. Example operators are aten:mm, aten::sigmoid, cudaLaunchKernel, etc…
Use --benchmark_operators to enable the operator-level benchmarking.
Use --limit_operator_results argument to specify the number of top runtime operators to benchmark.
Use --target_operators argument to list PyTorch operators to benchmark.

Example output:

TrainPipelineSparseDist             | Malloc retries (P50/P90/P100): 0.0 / 0.0 / 0.0 | Runtime (P90): 442.08 ms | Peak Memory alloc (P90): 24.23 GB | Peak Memory reserved (P90): 26.21 GB
operator_aten::copy_                | Malloc retries (P50/P90/P100): -1.0 / -1.0 / -1.0 | Runtime (P90): 39.21 ms | Peak Memory alloc (P90): 0.00 GB | Peak Memory reserved (P90): -0.00 GB
...

Differential Revision: D77676673

Summary:
This change introduces microbenchmarking for PyTorch operators.
Since we need to capture and measure each operator call (which is happening under the hood of PyTorch), we need to use torch.profiler.profile. Example operators are `aten:mm`, `aten::sigmoid`, `cudaLaunchKernel`, etc…
Use `--benchmark_operators` to enable the operator-level benchmarking.
Use `--limit_operator_results` argument to specify the number of top runtime operators to benchmark.
Use `--target_operators` argument to list PyTorch operators to benchmark.

Example output:
```
TrainPipelineSparseDist             | Malloc retries (P50/P90/P100): 0.0 / 0.0 / 0.0 | Runtime (P90): 442.08 ms | Peak Memory alloc (P90): 24.23 GB | Peak Memory reserved (P90): 26.21 GB
operator_aten::copy_                | Malloc retries (P50/P90/P100): -1.0 / -1.0 / -1.0 | Runtime (P90): 39.21 ms | Peak Memory alloc (P90): 0.00 GB | Peak Memory reserved (P90): -0.00 GB
...
```

Differential Revision: D77676673
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77676673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants