Skip to content

Conversation

frost-intel
Copy link
Contributor

Based on the NMS updates in pytorch/vision#8766, this PR moves the gather-keep section of the nms op from CPU to XPU. This causes a very minor slowdown for small num_boxes < 400 but drastically increases performance for large num_boxes by eliminating data transfer between XPU and CPU. Since the number of boxes is typically > 1000, this is a reasonable change.

Details

XPU New Code Timings
num_boxes = 10 med = 0.60ms +- 0.10  # _batched_nms_coordinate_trick
num_boxes = 10 med = 2.73ms +- 0.04  # _batched_nms_vanilla
num_boxes = 100 med = 0.60ms +- 0.03
num_boxes = 100 med = 2.75ms +- 0.07
num_boxes = 200 med = 0.60ms +- 0.03
num_boxes = 200 med = 2.77ms +- 0.05
num_boxes = 400 med = 0.61ms +- 0.03
num_boxes = 400 med = 2.80ms +- 0.03
num_boxes = 800 med = 0.61ms +- 0.02
num_boxes = 800 med = 2.81ms +- 0.03
num_boxes = 1000 med = 0.62ms +- 0.01
num_boxes = 1000 med = 2.15ms +- 0.12
num_boxes = 2000 med = 0.54ms +- 0.01
num_boxes = 2000 med = 2.15ms +- 0.01
num_boxes = 10000 med = 1.76ms +- 0.02
num_boxes = 10000 med = 3.25ms +- 0.02
num_boxes = 20000 med = 2.83ms +- 0.03
num_boxes = 20000 med = 4.74ms +- 0.02
num_boxes = 80000 med = 17.79ms +- 0.05
num_boxes = 80000 med = 12.27ms +- 0.03
num_boxes = 100000 med = 25.76ms +- 0.04
num_boxes = 100000 med = 15.43ms +- 0.04
num_boxes = 200000 med = 85.42ms +- 0.26
num_boxes = 200000 med = 36.35ms +- 0.04

XPU - main
num_boxes = 10 med = 0.47ms +- 0.08
num_boxes = 10 med = 2.35ms +- 0.07
num_boxes = 100 med = 0.59ms +- 0.03
num_boxes = 100 med = 2.40ms +- 0.09
num_boxes = 200 med = 0.60ms +- 0.04
num_boxes = 200 med = 2.46ms +- 0.06
num_boxes = 400 med = 0.60ms +- 0.03
num_boxes = 400 med = 2.98ms +- 0.03
num_boxes = 800 med = 0.61ms +- 0.01
num_boxes = 800 med = 2.98ms +- 0.02
num_boxes = 1000 med = 0.62ms +- 0.01
num_boxes = 1000 med = 3.01ms +- 0.02
num_boxes = 2000 med = 0.66ms +- 0.01
num_boxes = 2000 med = 3.34ms +- 0.02
num_boxes = 10000 med = 3.82ms +- 3.67
num_boxes = 10000 med = 5.31ms +- 1.82
num_boxes = 20000 med = 20.92ms +- 1.70
num_boxes = 20000 med = 7.22ms +- 1.43
num_boxes = 80000 med = 119.85ms +- 5.65
num_boxes = 80000 med = 90.21ms +- 3.99
num_boxes = 100000 med = 168.14ms +- 4.02
num_boxes = 100000 med = 123.07ms +- 1.49
num_boxes = 200000 med = 457.85ms +- 70.04
num_boxes = 200000 med = 254.54ms +- 5.27
import torch
from time import perf_counter_ns
from torchvision.ops import nms
from torchvision.ops.boxes import _batched_nms_coordinate_trick, _batched_nms_vanilla

def bench(f, *args, num_exp=1000, warmup=0, **kwargs):

    for _ in range(warmup):
        f(*args, **kwargs)

    times = []
    for _ in range(num_exp):
        start = perf_counter_ns()
        f(*args, **kwargs)
        torch.xpu.synchronize()
        end = perf_counter_ns()
        times.append(end - start)
    return torch.tensor(times).float()

def report_stats(times, unit="ms", prefix=""):
    mul = {
        "ns": 1,
        "µs": 1e-3,
        "ms": 1e-6,
        "s": 1e-9,
    }[unit]
    times = times * mul
    std = times.std().item()
    med = times.median().item()
    print(f"{prefix}{med = :.2f}{unit} +- {std:.2f}")
    return med


def make_boxes(num_boxes, num_classes=4, device="xpu"):
    boxes = torch.cat((torch.rand(num_boxes, 2), torch.rand(num_boxes, 2) + 10), dim=1).to(device)
    assert max(boxes[:, 0]) < min(boxes[:, 2])  # x1 < x2
    assert max(boxes[:, 1]) < min(boxes[:, 3])  # y1 < y2

    scores = torch.rand(num_boxes).to(device)
    idxs = torch.randint(0, num_classes, size=(num_boxes,)).to(device)
    return boxes, scores, idxs

NUM_EXP = 30
for num_boxes in (10, 100, 200, 400, 600, 800, 1000, 1400, 2000, 10000, 20_000, 80_000, 100000, 200_000):
    for f in (_batched_nms_coordinate_trick, _batched_nms_vanilla):
        boxes, scores, idxs = make_boxes(num_boxes)
        times = bench(f, boxes, scores, idxs, iou_threshold=.7, warmup=1, num_exp=NUM_EXP)
        report_stats(times, prefix=f"{num_boxes = } ")

@frost-intel frost-intel requested a review from Stonepia March 4, 2025 12:36
@frost-intel frost-intel added this pull request to the merge queue Mar 5, 2025
Merged via the queue into main with commit b8c05de Mar 5, 2025
8 of 9 checks passed
@frost-intel frost-intel deleted the frost/nms_xpu_only branch March 5, 2025 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants