-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hardware-metrics to semantic conventions #2518
Add hardware-metrics to semantic conventions #2518
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bertysentry Thank you for the PR.
The linked issue has not been discussed yet. I think we likely want to have hardware metrics in the spec, but I would want to hear from other spec approvers if they agree that it is a good idea to have semantic conventions for the hardware metrics.
If the community feels these do belong to the spec then we will need a detailed review which may take some time due to the PR's size. Unless there is a reason that all of these must be merged together it may be easier to make progress if this is presented as a few separate PRs, one per hardware area (once there is a general agreement that we do want hardware metrics in the spec).
A few generic questions/comments:
- Please rebase from main and fix the build.
- Do we want
hw.
as a prefix and notsystem.
? The linked issue talks about adding e.g.system.sensor.temp
. Also, do we want abbreviatedhw.
orhardware.
? Any arguments in favour of against any of these approaches? - How does this proposal compare with any existing industry standards for hardware metrics reporting (if any exist)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major suggestions:
- Use standard units.
- Consider/Explore putting metadata as Resource instead of Attributes (most of the data won't change - unless the hardware changed).
- Reference to other standards (e.g. S.M.A.R.T.).
I've approved the PR because all of the above suggestions can be addressed in follow up PRs. The overall direction looks good to me.
Thank you for the time you spent on this review and your comments, @tigrannajaryan Let me answer your questions:
Having something separate from Example: at the hardware level, you don't consider CPU time utilization (as it's purely an OS metric). So If we mix Other examples are disks (the physical disk behind a RAID controller are different from the disks seen by the operating system) and network interfaces (the OS has loopback and plenty of virtual adapters for aggregation, VPNs, etc.).
Currently, we have
My company has been developing and maintaining a hardware monitoring software for 18 years, that covers all models from all manufacturers (Brocade, Cisco, Dell-EMC, HP, Huawei, HP, NetApp, Oracle, Pure, Synology, etc. I can't list them all). We've dealt with all sorts of hardware instrumentation standards, from specific SNMP MIBs to DMTF's WBEM, to IPMI, to SMI-S, to Redfish and SMASH, to proprietary REST APIs, you name it. See https://www.sentrysoftware.com/docs/hardware-connectors/23/platform-requirements.html for a full list of we got covered. We've rewritten our proprietary hardware monitoring software as an OpenTelemetry Collector distribution (available for free, not yet open source because this takes some time). We've taken some time to redefine the metrics that can be collected following OpenTelemetry semantic convention, and that's why we're proposing this. With our experience, we are pretty confident that the proposed metrics cover 95% of the cases and are good enough to represent any system available out there. You can trust us on this 😎 (although we're open to discuss, obviously, and that's what this PR is for). Note: The current version of our product doesn't strictly follow the proposed convention. We want to agree on something before we spend time updating our code and all dashboards that rely on these metrics. |
@reyang Thank you for your time and the review! Let me reply to a few comments:
Done!
You are right and I would happily add that to the convention, but I don't think we can have a Resource belonging to another Resource. In the current state of the semantic convention, all metrics are supposed to be attached to a host Resource (with the
Done for S.M.A.R.T. |
@tigrannajaryan Could you please re-execute the workflow? All checks should pass now. Thank you! |
Thank you @tigrannajaryan for running the checks. Unfortunately, markdown-link-check fails with this error:
But this check runs just fine in my environment and the offending link is valid! So I'm not sure what to do here. Re-run the checks, hoping for better luck? 🤷♂️ |
Network error maybe. I will re-run. |
@tigrannajaryan @pirgeo Thank you for your comments. I pushed an update to remove the non-standard unit (and moved it to the description). Please let me know if you are okay with this change. |
@tigrannajaryan @pirgeo Please let me know if you're okay with the latest changes so I can resolve the corresponding conversations. Also, who else should I ask for a review? Thank you! 😊 |
I think the main complication we have with this PR is the enum metrics which are modelled as Gauge. We have a few options to move forward:
I am personally leaning towards option 3. |
@tigrannajaryan Thank you for your comment (and again, sorry I haven't seen it earlier, I'm not getting any notification, it's weird). I vote for option 1. As you said, the document is still in experimental state, so breaking changes are expected (it regularly happens with the Otel SDKs). We can add a comment for all Enum-like metrics; "HIGHLY SUBJECT TO CHANGE". I personally commit to updating this semantic convention once the Enum metric #1711 issue is settled. |
@open-telemetry/specs-metrics-approvers we need you opinion on this PR, especially on the question about enums. |
I am not sure if there is a way of introducing an Enum data type without changing the data type here and breaking the semantic conventions (see my comment on #1711 (comment)). Given that, I don't think we can mark these semantic conventions as stable before the Enum data type is out. Therefore, I am also leaning towards 3, since these semantic conventions will have to stay experimental while we wait until we can break them. |
@pirgeo @tigrannajaryan I updated the hardware-metrics semantic convention to remove the offending Enum metrics, and replaced with proper Attributes. Let me know of any remaining issue! Thank you again for the discussion, comments and suggestions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this. I see your points for Enums, and agree that they would be a better fit here. Until they are available, however, I believe this is a valid solution that works with the current state of the specification and is in line with the other semantic conventions. Thank you!
@jack-berg PR updated with your suggestion. Please approve, thank you! |
@jack-berg Thank you for the approval! As a member of the open-telemetry/instr-wg group, do you have a way to approve as this group? |
Unfortunately I'm not a memeber! |
@jack-berg That's too bad! You're officially listed here though: https://github.com/open-telemetry/community/blob/main/community-members.md#specifications-and-proto |
@tigrannajaryan How can we fix this? @jack-berg approved the PR, but as he's not really a member of the instr-wg group, his approval doesn't unlock the merge of this PR. Thanks! |
@tigrannajaryan PR ready to be merged, thank you all for your help, reviews and suggestions! 😊 |
@open-telemetry/specs-approvers This PR formally has the required number of approvals. Since the PR is non-trivial let's going to keep it open for 2 more days to give time to object and if we don't hear anything let's merge. @bogdandrutu since you are now back please comment if you have any additional thoughts. |
2 days later... 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some non-blocking comments asking for clarification below.
Sorry for being late to the party - feel free to address them in a follow-up PR instead.
@arminru All suggestions have been taken into account, and branch rebased on main. Please review the few items open for discussion, thank you! 😊 |
😭 <-- tears of joy |
Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>
Changes
Added hardware metrics to semantic conventions
Related to issue #2340