Skip to content

Verify the usefulness of the GPU Utilization metric compared to SM Efficiency #505

Open
@isaac091

Description

@isaac091

This article lays out how GPU Utilization is actually measured and shows that it is possible for the utilization to be very high without that being true in the most basic sense. For example, the author shares that in some of their initial testing, their models were were reaching "100% utilization" while only hitting 20% of the maximum theoretical Model FLOPS (Floating Point Operations per Second).

The article recommends looking at a metric called SM Efficiency (SM for streaming multiprocessor, also called SM Activity) that reports the % of SMs are active. Seeing a discrepancy between these metrics can be an indicator that there is some less visible bottleneck that can be helped by the usage of "fused kernels." Using Flash Attention or SDPA is one example of doing this, but there are also similar implementations for other types of layers readily available according to the article. I didn't look into these alternatives too much, so it's possible that we're already using more than one of them for their general benefits.

If nothing else, it may be useful to add SM efficiency to our standard set of metrics logged on ClearML. The metric is available in the NVIDIA Data Center GPU Manager (DCGM), and it is also available on-demand through nvidia-smi dmon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    optimizationModel training/inferencing optimization

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions