-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ksm] make time based metrics comparable. #11394
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm, but I think you need to delete the sections you aren't using ("upgrade" "feature" etc) from the release note template.
releasenotes/notes/ksm-comparable-metrics-b31f1f1fcbae1753.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @bjhaid ! The code change looks good.
The Datadog Cluster Agent can run the KSM Core check, could you also copy the release note file in the releasenotes-dca/notes
folder as well?
Thanks!
releasenotes/notes/ksm-comparable-metrics-b31f1f1fcbae1753.yaml
Outdated
Show resolved
Hide resolved
Prior to this patch the ksm collector would call `time.Now().Unix.()-metric.Val` at every point a `_time` metric is computed, this leads to the values collected in the same loop not being comparable since their values skews based on the value of `time.Now()` at the time metric was collected. For an example of the impact of the above, to compute how long it takes a pod to start after being created we create a function: ``` avg:kubernetes_state.pod.age{pod_name:filebeat-lf6hz} by {pod_name} - avg:kubernetes_state.pod.uptime{pod_name:filebeat-lf6hz} by {pod_name} ``` When this is plotted on a dashboard we end up with values fluctuating between negative and positive. To ensure the values are comparable the `currentTime` is snapshotted at the start of the Run loop and the value is reused in all places that had previously called `time.Now()`.
3be17db
to
2ebe0bf
Compare
@ahmed-mez do I need to do anything special for the cluster agent to pick up the change? As we actually run the cluster agent and are hoping to get this patch in the cluster-agent. |
@bjhaid the code is shared, so cluster agent v1.20 should contain this change when released. |
CI is green and CLA is signed
do you have an estimate on when the release schedule for v1.20 is? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an estimate on when the release schedule for v1.20 is? Thanks!
We're actively working on releasing v1.19 right now - v1.20 is expected to be released in ~2 months.
Thanks for the contribution!
@kayayarai PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An edit suggestion to change the latinate abbreviation to English. Otherwise 👍 for docs.
releasenotes-dca/notes/ksm-comparable-metrics-b31f1f1fcbae1753.yaml
Outdated
Show resolved
Hide resolved
releasenotes/notes/ksm-comparable-metrics-b31f1f1fcbae1753.yaml
Outdated
Show resolved
Hide resolved
….yaml Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
@ahmed-mez @kayayarai thanks a lot for your reviews, very much appreciated! |
@ahmed-mez do you have an ETA for v1.20? |
Hey @bjhaid Cluster Agent 1.20 was released last week |
What does this PR do?
Prior to this patch the ksm collector would call
time.Now().Unix.()-metric.Val
at every point a_time
metric iscomputed, this leads to the values collected in the same loop not being
comparable since their values skews based on the value of
time.Now()
at the time metric was collected.
For an example of the impact of the above, to compute how long it takes a
pod to start after being created we create a function:
When this is plotted on a dashboard we end up with values fluctuating
between negative and positive.
To ensure the values are comparable the
currentTime
is snapshotted atthe start of the Run loop and the value is reused in all places that had
previously called
time.Now()
.Motivation
We need to determine how long our pods take from creation to starting, to
measure control plane overhead. The screenshot below depicts the problem
this patch is attempting to solve:
Additional Notes
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Reviewer's Checklist
Triage
milestone is set.major_change
label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.changelog/no-changelog
label has been applied.qa/skip-qa
label is not applied.team/..
label has been applied, indicating the team(s) that should QA this change.need-change/operator
andneed-change/helm
labels have been applied.