Change how benchmarking code works #1

mag- · 2024-08-10T06:53:48Z

mag-
Aug 10, 2024
Maintainer

I think there are a few potential improvements we can make to the original benchmarking code:

    for i in range(num_warmup_iterations + num_iterations):
        with torch.no_grad():
            start.record()
            torch.mm(A, B, out=C)
            end.record()
        arch.synchronize()
        times[i] = start.elapsed_time(end)
    times = times[num_warmup_iterations:]
    elapsed_time = np.amin(times)/1000 # want the fastest

Suggested improvements:

Instead of just taking the minimum time, we should calculate and report more comprehensive statistics: Min/Max/Mean/Std Deviation
This gives a more complete picture of the performance characteristics.
The synchronization should be handled more carefully. To be honest I think it should be inside the loop. The PyTorch documentation recommends using their timeit module for accurate CUDA timing: https://pytorch.org/tutorials/recipes/recipes/benchmark.html
More documentation here: https://pytorch.org/docs/stable/benchmark_utils.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change how benchmarking code works #1

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Change how benchmarking code works #1

Uh oh!

Uh oh!

mag- Aug 10, 2024 Maintainer

Replies: 0 comments

mag-
Aug 10, 2024
Maintainer