Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ksm] make time based metrics comparable. #11394

Merged
merged 3 commits into from
Mar 25, 2022

Conversation

bjhaid
Copy link
Contributor

@bjhaid bjhaid commented Mar 21, 2022

What does this PR do?

Prior to this patch the ksm collector would call
time.Now().Unix.()-metric.Val at every point a _time metric is
computed, this leads to the values collected in the same loop not being
comparable since their values skews based on the value of time.Now()
at the time metric was collected.

For an example of the impact of the above, to compute how long it takes a
pod to start after being created we create a function:

avg:kubernetes_state.pod.age{pod_name:filebeat-lf6hz} by {pod_name}
- avg:kubernetes_state.pod.uptime{pod_name:filebeat-lf6hz} by {pod_name}

When this is plotted on a dashboard we end up with values fluctuating
between negative and positive.

To ensure the values are comparable the currentTime is snapshotted at
the start of the Run loop and the value is reused in all places that had
previously called time.Now().

Motivation

We need to determine how long our pods take from creation to starting, to
measure control plane overhead. The screenshot below depicts the problem
this patch is attempting to solve:

image

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Reviewer's Checklist

  • If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • A release note has been added or the changelog/no-changelog label has been applied.
  • Changed code has automated tests for its functionality.
  • Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • If applicable, the need-change/operator and need-change/helm labels have been applied.
  • If applicable, the config template has been updated.

@bjhaid bjhaid requested review from a team as code owners March 21, 2022 21:26
@bits-bot
Copy link
Collaborator

bits-bot commented Mar 21, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@kayayarai kayayarai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm, but I think you need to delete the sections you aren't using ("upgrade" "feature" etc) from the release note template.

@ahmed-mez ahmed-mez self-assigned this Mar 22, 2022
@ahmed-mez ahmed-mez added this to the DCA_1.20.0 milestone Mar 22, 2022
Copy link
Contributor

@ahmed-mez ahmed-mez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @bjhaid ! The code change looks good.

The Datadog Cluster Agent can run the KSM Core check, could you also copy the release note file in the releasenotes-dca/notes folder as well?

Thanks!

Prior to this patch the ksm collector would call
`time.Now().Unix.()-metric.Val` at every point a `_time` metric is
computed, this leads to the values collected in the same loop not being
comparable since their values skews based on the value of `time.Now()`
at the time metric was collected.

For an example of the impact of the above, to compute how long it takes a
pod to start after being created we create a function:

```
avg:kubernetes_state.pod.age{pod_name:filebeat-lf6hz} by {pod_name}
- avg:kubernetes_state.pod.uptime{pod_name:filebeat-lf6hz} by {pod_name}
```

When this is plotted on a dashboard we end up with values fluctuating
between negative and positive.

To ensure the values are comparable the `currentTime` is snapshotted at
the start of the Run loop and the value is reused in all places that had
previously called `time.Now()`.
@bjhaid bjhaid force-pushed the comparable_metrics branch from 3be17db to 2ebe0bf Compare March 22, 2022 15:08
@bjhaid bjhaid requested a review from a team as a code owner March 22, 2022 15:08
@bjhaid
Copy link
Contributor Author

bjhaid commented Mar 22, 2022

The Datadog Cluster Agent can run the KSM Core check

@ahmed-mez do I need to do anything special for the cluster agent to pick up the change? As we actually run the cluster agent and are hoping to get this patch in the cluster-agent.

@ahmed-mez
Copy link
Contributor

@bjhaid the code is shared, so cluster agent v1.20 should contain this change when released.
Happy to approve and merge the PR once the CI is green and the CLA is signed :)

@ahmed-mez ahmed-mez requested a review from kayayarai March 22, 2022 15:14
@bjhaid
Copy link
Contributor Author

bjhaid commented Mar 23, 2022

@ahmed-mez

Happy to approve and merge the PR once the CI is green and the CLA is signed :)

CI is green and CLA is signed

so cluster agent v1.20 should contain this change when released.

do you have an estimate on when the release schedule for v1.20 is? Thanks!

Copy link
Contributor

@ahmed-mez ahmed-mez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an estimate on when the release schedule for v1.20 is? Thanks!

We're actively working on releasing v1.19 right now - v1.20 is expected to be released in ~2 months.

Thanks for the contribution!

@bjhaid
Copy link
Contributor Author

bjhaid commented Mar 24, 2022

@kayayarai PTAL, thanks!

Copy link
Contributor

@kayayarai kayayarai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An edit suggestion to change the latinate abbreviation to English. Otherwise 👍 for docs.

ahmed-mez and others added 2 commits March 25, 2022 09:27
….yaml

Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
@ahmed-mez ahmed-mez merged commit d3daed8 into DataDog:main Mar 25, 2022
@bjhaid bjhaid deleted the comparable_metrics branch March 25, 2022 14:14
@bjhaid
Copy link
Contributor Author

bjhaid commented Mar 25, 2022

@ahmed-mez @kayayarai thanks a lot for your reviews, very much appreciated!

@bjhaid
Copy link
Contributor Author

bjhaid commented Jun 29, 2022

@ahmed-mez do you have an ETA for v1.20?

@ahmed-mez
Copy link
Contributor

Hey @bjhaid Cluster Agent 1.20 was released last week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants