Skip to content

Adding win-x64 support for nvidia-smi #546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"Description": "Default Monitors for Nvidia GPU systems.",
"Metadata": {
"SupportedPlatforms": "linux-arm64,linux-x64",
"SupportedPlatforms": "linux-arm64,linux-x64,win-x64",
"SupportedOperatingSystems": "CBL-Mariner,CentOS,Debian,RedHat,Suse,Ubuntu,Windows"
},
"Parameters": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ namespace VirtualClient.Monitors
/// <summary>
/// The Performance Counter Monitor for Virtual Client
/// </summary>
[SupportedPlatforms("linux-arm64,linux-x64")]
[SupportedPlatforms("linux-arm64,linux-x64,win-x64")]
public class NvidiaSmiMonitor : VirtualClientIntervalBasedMonitor
{
/// <summary>
Expand All @@ -39,34 +39,31 @@ protected override Task ExecuteAsync(EventContext telemetryContext, Cancellation
{
try
{
if (this.Platform == PlatformID.Unix)
// Check that nvidia-smi is installed. If not, we exit the monitor.
bool toolsetInstalled = await this.VerifyToolsetInstalledAsync(telemetryContext, cancellationToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they use exactly the same commandline and output? also no need append .exe on windows? in some scripting environment you need explicitly .exe appended.


if (toolsetInstalled)
{
// Check that nvidia-smi is installed. If not, we exit the monitor.
bool toolsetInstalled = await this.VerifyToolsetInstalledAsync(telemetryContext, cancellationToken);
await this.WaitAsync(this.MonitorWarmupPeriod, cancellationToken);

if (toolsetInstalled)
int iterations = 0;
while (!cancellationToken.IsCancellationRequested)
{
await this.WaitAsync(this.MonitorWarmupPeriod, cancellationToken);

int iterations = 0;
while (!cancellationToken.IsCancellationRequested)
try
{
try
{
iterations++;
if (this.IsIterationComplete(iterations))
{
break;
}

await this.QueryC2CAsync(telemetryContext, cancellationToken);
await this.QueryGpuAsync(telemetryContext, cancellationToken);
await this.WaitAsync(this.MonitorFrequency, cancellationToken);
}
catch (Exception exc)
iterations++;
if (this.IsIterationComplete(iterations))
{
this.Logger.LogErrorMessage(exc, telemetryContext, LogLevel.Warning);
break;
}

await this.QueryC2CAsync(telemetryContext, cancellationToken);
await this.QueryGpuAsync(telemetryContext, cancellationToken);
await this.WaitAsync(this.MonitorFrequency, cancellationToken);
}
catch (Exception exc)
{
this.Logger.LogErrorMessage(exc, telemetryContext, LogLevel.Warning);
}
}
}
Expand Down Expand Up @@ -140,7 +137,7 @@ private async Task QueryC2CAsync(EventContext telemetryContext, CancellationToke
this.Logger.LogErrorMessage(exc, telemetryContext, LogLevel.Warning);
}
}

private async Task QueryGpuAsync(EventContext telemetryContext, CancellationToken cancellationToken)
{
// This is the Nvidia smi query gpu command
Expand All @@ -161,7 +158,7 @@ private async Task QueryGpuAsync(EventContext telemetryContext, CancellationToke
"ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.sram," +
"ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram," +
"ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.sram," +
"ecc.errors.uncorrected.aggregate.total " +
"ecc.errors.uncorrected.aggregate.total " +
"--format=csv,nounits";

DateTime nextIteration = DateTime.UtcNow;
Expand Down
3 changes: 2 additions & 1 deletion website/docs/monitors/0300-nvidia-smi.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Nvidia SMI
The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state. It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs.
This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state. It is targeted at the Blackwell, Hopper, Ampere, TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs.

NVIDIA-smi ships with NVIDIA GPU display drivers on Linux, and with 64bit Windows Server 2008 R2 and Windows 7. Nvidia-smi can report query information as XML or human readable plain text to either standard output or a file. For more details, please refer to the nvidia-smi documentation.

Expand All @@ -14,6 +14,7 @@ This monitor has dependency on nvidia-smi. Please use [Nvidia Driver Installatio
## Supported Platforms
* linux-x64
* linux-arm64
* win-x64

## Supported Query
Right now the query supported are --query-gpu and --query-c2c. Please create a feature request if you need other queries.
Expand Down
Loading