-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. #193341
Conversation
Pinging @elastic/response-ops (Team:ResponseOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternalUser question below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the ES client and not the SO client to do partial updates to a saved object has several risks:
- saved objects are an abstraction to the raw ES doc, making it complex to transform into the correct shape ES expects.
- access to the internal ES indices is restricted i.t.o the request scope
- BWC: partial updates aren’t supported and cannot be guaranteed to work. The SO client takes care of BWC, up and down migrations and all the serialization/deserialization that happens in between
- The PR description hints to multiple namespaces. Updating a shared object means updating it’s instance in all spaces it is shared in and that requires auth checks.
- Core had to abandon partial updates in the SO client because of BWC.
- Core would also have to deal with any general migration issues because of docs that are corrupt from improper partial updates.
Thanks for underlying the risks, @TinaHeiligers. Our goal is to optimize for performance in this scenario because 1) we need to reduce the number of requests the alerting framework performs to Elasticsearch at scale and 2) we want to increase the task throughput per Kibana node by removing additional processing overhead (CPU bound operations, operations that add delays, etc) in favour of running more tasks. We've seen quite a bit of overhead necessary when using the saved-objects client, which made us investigate places where we may be able to work directly with the indices like we do in the Task Manager. We've seen great results when prototyping this and it felt the tradeoffs were becoming worthwhile. We've been able to scale down the number of ES and KB nodes when running at scale because of the reduced number of overall requests made. Otherwise at 10x scale, it becomes another 32,000 additional GET requests per minute that Elasticsearch needs to handle and for Kibana to process in this particular scenario. We don't foresee changing the list of fields often in which we perform partial updates to, if ever that becomes the case, perhaps we can use an intermediate release to do so? |
I'm not sure what this means. Maybe we'd need to add additional headers, or specifically use "asInternal" vs user API key to access (or vice versa)?
The implication from the description is to create several rules, in serveral spaces. Alerting rules are space-specific, they cannot be shared. AFAIK, there are no requests for us to make them shared. |
Perfect! Core also has no plans to allow namespace changes in the future, so this isn't a concern here. |
I'm not against using the ES client for performance gains, as long as the team is fully aware of the risks. Please add a risk matrix to the PR description! It's a way of documenting decisions and sharing knowledge with other teams that might end up referencing this PR as inspiration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; left a comment about maybe making the partially-updated attribute list explicit. No human will easily figure it out from the TS types :-). Not sure how feasible that is though ...
options: PartiallyUpdateRuleSavedObjectOptions = {} | ||
): Promise<void> { | ||
// ensure we only have the valid attributes that are not encrypted and are excluded from AAD | ||
const attributeUpdates = omit(attributes, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems slighty dangerous, in case we forget to include one of the attribues in the lists used below.
Feels like it might be better to have an explicit list of attributes that we ALLOW to be partially updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree as well. Disclaimer - typically I would categorize something like this as "strongly not recommended", but with intimate knowledge of an ESO type and careful maintenance, this is possible to do safely.
Filtering by an explicit safe list of just the attributes that we expect to modify in this way, and also omitting any encrypted/AAD attributes (just in case), seems most prudent. Even if it requires a little bit more maintenance in the long run, this has potential to go quite poorly, so we should do what we can to prevent that possibility.
cc @azasypkin to get his 2 cents as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ymao1 Can we wait to merge this until Oleg has a chance to take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeramysoucy Yes absolutely! I'll make the changes for the AAD attributes and will wait to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added allowlist in this commit: 4683078
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @azasypkin and he is 👍 with an additional comment about why we're bypassing audit logging (added in 409a2fa) and a followup issue for the Core team to investigate how they might expose a more performance SO client (#194435)
@TinaHeiligers I've added a risk matrix to the PR description. Thanks for flagging! |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly declaring a list of attributes that can be updated mitigates some of the risks involved in using the ES client directly.
LGTM
options: PartiallyUpdateRuleSavedObjectOptions = {} | ||
): Promise<void> { | ||
// ensure we only have the valid attributes that are not encrypted and are excluded from AAD | ||
const attributeUpdates = omit(attributes, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally. Looks good to me!
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: cc @ymao1 |
Starting backport for target branches: 8.x |
…le run instead of SO client. (elastic#193341) Resolves elastic#192397 ## Summary Updates alerting task runner end of run updates to use the ES client update function for a true partial update instead of the saved objects client update function that performs a GET then an update. ## To verify Create a rule in multiple spaces and ensure they run correctly and their execution status and monitoring history are updated at the end of each run. Because we're performing a partial update on attributes that are not in the AAD, the rule should continue running without any encryption errors. ## Risk Matrix | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Updating saved object directly using ES client will break BWC | Medium | High | Response Ops follows an intermediate release strategy for any changes to the rule saved object where schema changes are introduced in an intermediate release before any changes to the saved object are actually made in a followup release. This ensures that any rollbacks that may be required in a release will roll back to a version that is already aware of the new schema. The team is socialized to this strategy as we are requiring users of the alerting framework to also follow this strategy. This should address any backward compatibility issues that might arise by circumventing the saved objects client update function. | | Updating saved object directly using ES client will break AAD | Medium | High | An explicit allowlist of non-AAD fields that are allowed to be partially updated has been introduced and any fields not in this allowlist will not be included in the partial update. Any updates to the rule saved object that might break AAD would show up with > 1 execution of a rule and we have a plethora of functional tests that rely on multiple executions of a rule that would flag if there were issues running due to AAD issues. | --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> (cherry picked from commit 05926c2)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
… of rule run instead of SO client. (#193341) (#194444) # Backport This will backport the following commits from `main` to `8.x`: - [[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. (#193341)](#193341) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ying Mao","email":"ying.mao@elastic.co"},"sourceCommit":{"committedDate":"2024-09-30T14:40:02Z","message":"[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. (#193341)\n\nResolves https://github.com/elastic/kibana/issues/192397\r\n\r\n## Summary\r\n\r\nUpdates alerting task runner end of run updates to use the ES client\r\nupdate function for a true partial update instead of the saved objects\r\nclient update function that performs a GET then an update.\r\n\r\n## To verify\r\nCreate a rule in multiple spaces and ensure they run correctly and their\r\nexecution status and monitoring history are updated at the end of each\r\nrun. Because we're performing a partial update on attributes that are\r\nnot in the AAD, the rule should continue running without any encryption\r\nerrors.\r\n\r\n## Risk Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes |\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n| Updating saved object directly using ES client will break BWC | Medium\r\n| High | Response Ops follows an intermediate release strategy for any\r\nchanges to the rule saved object where schema changes are introduced in\r\nan intermediate release before any changes to the saved object are\r\nactually made in a followup release. This ensures that any rollbacks\r\nthat may be required in a release will roll back to a version that is\r\nalready aware of the new schema. The team is socialized to this strategy\r\nas we are requiring users of the alerting framework to also follow this\r\nstrategy. This should address any backward compatibility issues that\r\nmight arise by circumventing the saved objects client update function. |\r\n| Updating saved object directly using ES client will break AAD | Medium\r\n| High | An explicit allowlist of non-AAD fields that are allowed to be\r\npartially updated has been introduced and any fields not in this\r\nallowlist will not be included in the partial update. Any updates to the\r\nrule saved object that might break AAD would show up with > 1 execution\r\nof a rule and we have a plethora of functional tests that rely on\r\nmultiple executions of a rule that would flag if there were issues\r\nrunning due to AAD issues. |\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"05926c20c57b7abc69c6c068d5733f29306f73ba","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","v9.0.0","backport:prev-minor","v8.16.0"],"title":"[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client.","number":193341,"url":"https://github.com/elastic/kibana/pull/193341","mergeCommit":{"message":"[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. (#193341)\n\nResolves https://github.com/elastic/kibana/issues/192397\r\n\r\n## Summary\r\n\r\nUpdates alerting task runner end of run updates to use the ES client\r\nupdate function for a true partial update instead of the saved objects\r\nclient update function that performs a GET then an update.\r\n\r\n## To verify\r\nCreate a rule in multiple spaces and ensure they run correctly and their\r\nexecution status and monitoring history are updated at the end of each\r\nrun. Because we're performing a partial update on attributes that are\r\nnot in the AAD, the rule should continue running without any encryption\r\nerrors.\r\n\r\n## Risk Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes |\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n| Updating saved object directly using ES client will break BWC | Medium\r\n| High | Response Ops follows an intermediate release strategy for any\r\nchanges to the rule saved object where schema changes are introduced in\r\nan intermediate release before any changes to the saved object are\r\nactually made in a followup release. This ensures that any rollbacks\r\nthat may be required in a release will roll back to a version that is\r\nalready aware of the new schema. The team is socialized to this strategy\r\nas we are requiring users of the alerting framework to also follow this\r\nstrategy. This should address any backward compatibility issues that\r\nmight arise by circumventing the saved objects client update function. |\r\n| Updating saved object directly using ES client will break AAD | Medium\r\n| High | An explicit allowlist of non-AAD fields that are allowed to be\r\npartially updated has been introduced and any fields not in this\r\nallowlist will not be included in the partial update. Any updates to the\r\nrule saved object that might break AAD would show up with > 1 execution\r\nof a rule and we have a plethora of functional tests that rely on\r\nmultiple executions of a rule that would flag if there were issues\r\nrunning due to AAD issues. |\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"05926c20c57b7abc69c6c068d5733f29306f73ba"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/193341","number":193341,"mergeCommit":{"message":"[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. (#193341)\n\nResolves https://github.com/elastic/kibana/issues/192397\r\n\r\n## Summary\r\n\r\nUpdates alerting task runner end of run updates to use the ES client\r\nupdate function for a true partial update instead of the saved objects\r\nclient update function that performs a GET then an update.\r\n\r\n## To verify\r\nCreate a rule in multiple spaces and ensure they run correctly and their\r\nexecution status and monitoring history are updated at the end of each\r\nrun. Because we're performing a partial update on attributes that are\r\nnot in the AAD, the rule should continue running without any encryption\r\nerrors.\r\n\r\n## Risk Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes |\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n| Updating saved object directly using ES client will break BWC | Medium\r\n| High | Response Ops follows an intermediate release strategy for any\r\nchanges to the rule saved object where schema changes are introduced in\r\nan intermediate release before any changes to the saved object are\r\nactually made in a followup release. This ensures that any rollbacks\r\nthat may be required in a release will roll back to a version that is\r\nalready aware of the new schema. The team is socialized to this strategy\r\nas we are requiring users of the alerting framework to also follow this\r\nstrategy. This should address any backward compatibility issues that\r\nmight arise by circumventing the saved objects client update function. |\r\n| Updating saved object directly using ES client will break AAD | Medium\r\n| High | An explicit allowlist of non-AAD fields that are allowed to be\r\npartially updated has been introduced and any fields not in this\r\nallowlist will not be included in the partial update. Any updates to the\r\nrule saved object that might break AAD would show up with > 1 execution\r\nof a rule and we have a plethora of functional tests that rely on\r\nmultiple executions of a rule that would flag if there were issues\r\nrunning due to AAD issues. |\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"05926c20c57b7abc69c6c068d5733f29306f73ba"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Ying Mao <ying.mao@elastic.co>
Resolves #192397
Summary
Updates alerting task runner end of run updates to use the ES client update function for a true partial update instead of the saved objects client update function that performs a GET then an update.
To verify
Create a rule in multiple spaces and ensure they run correctly and their execution status and monitoring history are updated at the end of each run. Because we're performing a partial update on attributes that are not in the AAD, the rule should continue running without any encryption errors.
Risk Matrix