Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] Updating TM metrics to handle when capacity estimation returns NaN #207116

Merged
merged 9 commits into from
Feb 12, 2025

Conversation

doakalexi
Copy link
Contributor

@doakalexi doakalexi commented Jan 17, 2025

Resolves #204467

Summary

assumedRequiredThroughputPerMinutePerKibana is NaN when the capacityStats.runtime.value.load.p90 is undefined. This PR adds a check to catch when the load.p90 is undefined, throw an error, and ignore calculating the capacity estimation.

Checklist

To verify

I was not able to reproduce this locally without changing the code, so here is how I tested the code and I am definitely open to suggestions of how to better test this.

  1. Update the code to set capacityStats.runtime.value.load.p90: undefined. I set it here, but there are other places upstream where you could set it to undefined.
  2. Start Kibana
  3. Verify that you see the following log message:
 Task manager had an issue calculating capacity estimation. averageLoadPercentage: undefined

@doakalexi doakalexi changed the title Updating TM metrics to handle when capacity estimation returns NaN [ResponseOps] Updating TM metrics to handle when capacity estimation returns NaN Jan 22, 2025
@doakalexi doakalexi added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) backport:prev-minor Backport to (9.0) the previous minor version (i.e. one version back from main) v8.18.0 v9.0.0 release_note:skip Skip the PR/issue when compiling release notes labels Jan 22, 2025
@doakalexi doakalexi marked this pull request as ready for review January 22, 2025 17:51
@doakalexi doakalexi requested a review from a team as a code owner January 22, 2025 17:51
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@doakalexi doakalexi requested review from ymao1 and JiaweiWu January 22, 2025 17:51
Copy link
Contributor

@JiaweiWu JiaweiWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ymao1
Copy link
Contributor

ymao1 commented Jan 28, 2025

How were you able to determine that capacityStats.runtime.value.load.p90 was occasionally undefined? I see a lot of values being passed to the percentageOf function which will return NaN if any are undefined.

@doakalexi
Copy link
Contributor Author

doakalexi commented Jan 28, 2025

How were you able to determine that capacityStats.runtime.value.load.p90 was occasionally undefined? I see a lot of values being passed to the percentageOf function which will return NaN if any are undefined.

When I was looking in the logs for assumedRequiredThroughputPerMinutePerKibana: NaN I noticed that there were only a few of the metrics that were also null under the capacity estimation. I looked at the common fields that all of those null fields were calculated off of and I was able to determine that it was bc of capacityStats.runtime.value.load.p90. In the logs it seems like every time assumedRequiredThroughputPerMinutePerKibana is NaN the runtime load values are all null.

@ymao1
Copy link
Contributor

ymao1 commented Jan 29, 2025

@doakalexi Gotcha. When the capacityStats.runtime.value.load.p90 was undefined, were the other values also undefined? (p50, p95, p99)?

@doakalexi
Copy link
Contributor Author

@doakalexi Gotcha. When the capacityStats.runtime.value.load.p90 was undefined, were the other values also undefined? (p50, p95, p99)?

Yes, but those values aren't used in the calculations

@ymao1
Copy link
Contributor

ymao1 commented Jan 29, 2025

Yea, I was just wondering if they were 0 or something. I'm wondering if we should fall back to 0 if the value is undefined instead of throwing an error but I'm not sure what impact that would have on the capacity estimation

@doakalexi
Copy link
Contributor Author

Yea, I was just wondering if they were 0 or something. I'm wondering if we should fall back to 0 if the value is undefined instead of throwing an error but I'm not sure what impact that would have on the capacity estimation

Sure we can do that, I considered defaulting to a value but I didn't know if that would cause confusion. I can try it out to see what happens to the calculations

@doakalexi
Copy link
Contributor Author

Yea, I was just wondering if they were 0 or something. I'm wondering if we should fall back to 0 if the value is undefined instead of throwing an error but I'm not sure what impact that would have on the capacity estimation

Sure we can do that, I considered defaulting to a value but I didn't know if that would cause confusion. I can try it out to see what happens to the calculations

I tested out what would happen if the value is 0 and it didn't seem to impact the capacity estimation. Should I go ahead and default to 0?

@doakalexi
Copy link
Contributor Author

After talking offline, we decided to keep the error instead of defaulting to 0.

@doakalexi doakalexi added backport:version Backport to applied version labels and removed backport:prev-minor Backport to (9.0) the previous minor version (i.e. one version back from main) labels Feb 6, 2025
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

@doakalexi doakalexi merged commit 8bff766 into elastic:main Feb 12, 2025
9 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/13292098469

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Feb 12, 2025
…returns NaN (elastic#207116)

Resolves elastic#204467

## Summary

`assumedRequiredThroughputPerMinutePerKibana` is `NaN` when the
`capacityStats.runtime.value.load.p90` is undefined. This PR adds a
check to catch when the load.p90 is undefined, throw an error, and
ignore calculating the capacity estimation.

### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

### To verify
I was not able to reproduce this locally without changing the code, so
here is how I tested the code and I am definitely open to suggestions of
how to better test this.

1. Update the code to set `capacityStats.runtime.value.load.p90:
undefined`. I set it
[here](https://github.com/elastic/kibana/blob/286c9e2ddb9f338b0981cc5145bb4179ef7657cb/x-pack/platform/plugins/shared/task_manager/server/monitoring/capacity_estimation.ts#L55),
but there are other places upstream where you could set it to
`undefined`.
2. Start Kibana
3. Verify that you see the following log message:
```
 Task manager had an issue calculating capacity estimation. averageLoadPercentage: undefined
```

(cherry picked from commit 8bff766)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Feb 14, 2025
@kibanamachine
Copy link
Contributor

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.

kibanamachine added a commit that referenced this pull request Feb 14, 2025
…ation returns NaN (#207116) (#210914)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[ResponseOps] Updating TM metrics to handle when capacity estimation
returns NaN (#207116)](#207116)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Alexi
Doak","email":"109488926+doakalexi@users.noreply.github.com"},"sourceCommit":{"committedDate":"2025-02-12T18:16:35Z","message":"[ResponseOps]
Updating TM metrics to handle when capacity estimation returns NaN
(#207116)\n\nResolves
https://github.com/elastic/kibana/issues/204467\r\n\r\n##
Summary\r\n\r\n`assumedRequiredThroughputPerMinutePerKibana` is `NaN`
when the\r\n`capacityStats.runtime.value.load.p90` is undefined. This PR
adds a\r\ncheck to catch when the load.p90 is undefined, throw an error,
and\r\nignore calculating the capacity estimation.\r\n\r\n\r\n###
Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n\r\n### To
verify\r\nI was not able to reproduce this locally without changing the
code, so\r\nhere is how I tested the code and I am definitely open to
suggestions of\r\nhow to better test this.\r\n\r\n1. Update the code to
set `capacityStats.runtime.value.load.p90:\r\nundefined`. I set
it\r\n[here](https://github.com/elastic/kibana/blob/286c9e2ddb9f338b0981cc5145bb4179ef7657cb/x-pack/platform/plugins/shared/task_manager/server/monitoring/capacity_estimation.ts#L55),\r\nbut
there are other places upstream where you could set it
to\r\n`undefined`.\r\n2. Start Kibana\r\n3. Verify that you see the
following log message:\r\n```\r\n Task manager had an issue calculating
capacity estimation. averageLoadPercentage:
undefined\r\n```","sha":"8bff7660950a303600317bc3c73839c3d26c028a","branchLabelMapping":{"^v9.1.0$":"main","^v8.19.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:ResponseOps","backport:version","v9.1.0","v8.19.0"],"title":"[ResponseOps]
Updating TM metrics to handle when capacity estimation returns
NaN","number":207116,"url":"https://github.com/elastic/kibana/pull/207116","mergeCommit":{"message":"[ResponseOps]
Updating TM metrics to handle when capacity estimation returns NaN
(#207116)\n\nResolves
https://github.com/elastic/kibana/issues/204467\r\n\r\n##
Summary\r\n\r\n`assumedRequiredThroughputPerMinutePerKibana` is `NaN`
when the\r\n`capacityStats.runtime.value.load.p90` is undefined. This PR
adds a\r\ncheck to catch when the load.p90 is undefined, throw an error,
and\r\nignore calculating the capacity estimation.\r\n\r\n\r\n###
Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n\r\n### To
verify\r\nI was not able to reproduce this locally without changing the
code, so\r\nhere is how I tested the code and I am definitely open to
suggestions of\r\nhow to better test this.\r\n\r\n1. Update the code to
set `capacityStats.runtime.value.load.p90:\r\nundefined`. I set
it\r\n[here](https://github.com/elastic/kibana/blob/286c9e2ddb9f338b0981cc5145bb4179ef7657cb/x-pack/platform/plugins/shared/task_manager/server/monitoring/capacity_estimation.ts#L55),\r\nbut
there are other places upstream where you could set it
to\r\n`undefined`.\r\n2. Start Kibana\r\n3. Verify that you see the
following log message:\r\n```\r\n Task manager had an issue calculating
capacity estimation. averageLoadPercentage:
undefined\r\n```","sha":"8bff7660950a303600317bc3c73839c3d26c028a"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/207116","number":207116,"mergeCommit":{"message":"[ResponseOps]
Updating TM metrics to handle when capacity estimation returns NaN
(#207116)\n\nResolves
https://github.com/elastic/kibana/issues/204467\r\n\r\n##
Summary\r\n\r\n`assumedRequiredThroughputPerMinutePerKibana` is `NaN`
when the\r\n`capacityStats.runtime.value.load.p90` is undefined. This PR
adds a\r\ncheck to catch when the load.p90 is undefined, throw an error,
and\r\nignore calculating the capacity estimation.\r\n\r\n\r\n###
Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n\r\n### To
verify\r\nI was not able to reproduce this locally without changing the
code, so\r\nhere is how I tested the code and I am definitely open to
suggestions of\r\nhow to better test this.\r\n\r\n1. Update the code to
set `capacityStats.runtime.value.load.p90:\r\nundefined`. I set
it\r\n[here](https://github.com/elastic/kibana/blob/286c9e2ddb9f338b0981cc5145bb4179ef7657cb/x-pack/platform/plugins/shared/task_manager/server/monitoring/capacity_estimation.ts#L55),\r\nbut
there are other places upstream where you could set it
to\r\n`undefined`.\r\n2. Start Kibana\r\n3. Verify that you see the
following log message:\r\n```\r\n Task manager had an issue calculating
capacity estimation. averageLoadPercentage:
undefined\r\n```","sha":"8bff7660950a303600317bc3c73839c3d26c028a"}},{"branch":"8.x","label":"v8.19.0","branchLabelMappingKey":"^v8.19.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Alexi Doak <109488926+doakalexi@users.noreply.github.com>
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.19.0 v9.1.0
Projects
None yet
5 participants