Skip to content

Issues with Hardware Metrics semantic conventions #940

@eero-t

Description

@eero-t

What are you trying to achieve?

Follow recommendations on: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

What did you expect to see?

Consistent and implementable specification.

Additional context.

Ran across following issues when trying to map GPU metrics to the semantics...

Non-implementable / unspecified items:

  • Single firmware_version won't work:
    • Devices have multiple types of firmware (display, media, scheduling/power management etc)
  • hw.errors have just hw.error.type attribute, although:
    • Errors can be correctable, uncorrectable (data lost), or fatal (functionality lost, at least reset needed to recover)
    • Errors can originate from different parts of the SW stack (FW, kernel, userspace driver)
    • Errors can originate from different parts of the device HW (display, media, compute, 3d etc)
    • => I suggest adding .category attribute, similarly to Level-Zero spec:

Inconsistencies:

  • hw.gpu.power vs. hw.power{hw.type="gpu"} confusion
    • If both are valid, why there's no hw.gpu.energy to match hw.energy{hw.type="gpu"}?
  • Common HW name & id attributes vs. GPU model & serial attributes
    • Should all of these be provided despite overlap?
  • Why vendor attribute is used for GPU devices, but manufacturer for (host) device: https://opentelemetry.io/docs/specs/semconv/resource/device/
  • Inconsistent attribute examples for GPU metrics missing from spec:
    • system.cpu.frequency vs. hw.cpu.speed
    • .utilization vs. .*_ratio [1] suffix for things like (frequency) throttling
    • whether in addition to base metric, one should provide .utilization / .*_ratio, or .limit value?

[1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics

PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional all category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-grained categories are missing, and/or in addition to them.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions