[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. #193341

ymao1 · 2024-09-18T17:35:06Z

Resolves #192397

Summary

Updates alerting task runner end of run updates to use the ES client update function for a true partial update instead of the saved objects client update function that performs a GET then an update.

To verify

Create a rule in multiple spaces and ensure they run correctly and their execution status and monitoring history are updated at the end of each run. Because we're performing a partial update on attributes that are not in the AAD, the rule should continue running without any encryption errors.

Risk Matrix

Risk	Probability	Severity	Mitigation/Notes
Updating saved object directly using ES client will break BWC	Medium	High	Response Ops follows an intermediate release strategy for any changes to the rule saved object where schema changes are introduced in an intermediate release before any changes to the saved object are actually made in a followup release. This ensures that any rollbacks that may be required in a release will roll back to a version that is already aware of the new schema. The team is socialized to this strategy as we are requiring users of the alerting framework to also follow this strategy. This should address any backward compatibility issues that might arise by circumventing the saved objects client update function.
Updating saved object directly using ES client will break AAD	Medium	High	An explicit allowlist of non-AAD fields that are allowed to be partially updated has been introduced and any fields not in this allowlist will not be included in the partial update. Any updates to the rule saved object that might break AAD would show up with > 1 execution of a rule and we have a plethora of functional tests that rely on multiple executions of a rule that would flag if there were issues running due to AAD issues.

elasticmachine · 2024-09-18T19:36:55Z

Pinging @elastic/response-ops (Team:ResponseOps)

kc13greiner

InternalUser question below

x-pack/plugins/alerting/server/task_runner/task_runner.ts

TinaHeiligers

Using the ES client and not the SO client to do partial updates to a saved object has several risks:

saved objects are an abstraction to the raw ES doc, making it complex to transform into the correct shape ES expects.
access to the internal ES indices is restricted i.t.o the request scope
BWC: partial updates aren’t supported and cannot be guaranteed to work. The SO client takes care of BWC, up and down migrations and all the serialization/deserialization that happens in between
The PR description hints to multiple namespaces. Updating a shared object means updating it’s instance in all spaces it is shared in and that requires auth checks.
Core had to abandon partial updates in the SO client because of BWC.
Core would also have to deal with any general migration issues because of docs that are corrupt from improper partial updates.

mikecote · 2024-09-19T17:59:27Z

Thanks for underlying the risks, @TinaHeiligers. Our goal is to optimize for performance in this scenario because 1) we need to reduce the number of requests the alerting framework performs to Elasticsearch at scale and 2) we want to increase the task throughput per Kibana node by removing additional processing overhead (CPU bound operations, operations that add delays, etc) in favour of running more tasks. We've seen quite a bit of overhead necessary when using the saved-objects client, which made us investigate places where we may be able to work directly with the indices like we do in the Task Manager.

We've seen great results when prototyping this and it felt the tradeoffs were becoming worthwhile. We've been able to scale down the number of ES and KB nodes when running at scale because of the reduced number of overall requests made. Otherwise at 10x scale, it becomes another 32,000 additional GET requests per minute that Elasticsearch needs to handle and for Kibana to process in this particular scenario.

We don't foresee changing the list of fields often in which we perform partial updates to, if ever that becomes the case, perhaps we can use an intermediate release to do so?

pmuellr · 2024-09-19T18:29:09Z

access to the internal ES indices is restricted i.t.o the request scope

I'm not sure what this means. Maybe we'd need to add additional headers, or specifically use "asInternal" vs user API key to access (or vice versa)?

The PR description hints to multiple namespaces. Updating a shared object means updating it’s instance in all spaces it is shared in and that requires auth checks.

The implication from the description is to create several rules, in serveral spaces. Alerting rules are space-specific, they cannot be shared. AFAIK, there are no requests for us to make them shared.

TinaHeiligers · 2024-09-19T18:31:01Z

The implication from the description is to create several rules, in serveral spaces. Alerting rules are space-specific, they cannot be shared. AFAIK, there are no requests for us to make them shared.

Perfect! Core also has no plans to allow namespace changes in the future, so this isn't a concern here.

TinaHeiligers · 2024-09-19T18:48:36Z

I'm not against using the ES client for performance gains, as long as the team is fully aware of the risks. Please add a risk matrix to the PR description! It's a way of documenting decisions and sharing knowledge with other teams that might end up referencing this PR as inspiration.
Please make sure there's good code coverage and validation to catch edge cases!

pmuellr

LGTM; left a comment about maybe making the partially-updated attribute list explicit. No human will easily figure it out from the TS types :-). Not sure how feasible that is though ...

pmuellr · 2024-09-19T18:24:55Z

x-pack/plugins/alerting/server/saved_objects/partially_update_rule.ts

+  options: PartiallyUpdateRuleSavedObjectOptions = {}
+): Promise<void> {
+  // ensure we only have the valid attributes that are not encrypted and are excluded from AAD
+  const attributeUpdates = omit(attributes, [


This seems slighty dangerous, in case we forget to include one of the attribues in the lists used below.

Feels like it might be better to have an explicit list of attributes that we ALLOW to be partially updated.

I agree as well. Disclaimer - typically I would categorize something like this as "strongly not recommended", but with intimate knowledge of an ESO type and careful maintenance, this is possible to do safely.

Filtering by an explicit safe list of just the attributes that we expect to modify in this way, and also omitting any encrypted/AAD attributes (just in case), seems most prudent. Even if it requires a little bit more maintenance in the long run, this has potential to go quite poorly, so we should do what we can to prevent that possibility.

cc @azasypkin to get his 2 cents as well

@ymao1 Can we wait to merge this until Oleg has a chance to take a look?

@jeramysoucy Yes absolutely! I'll make the changes for the AAD attributes and will wait to merge

Added allowlist in this commit: 4683078

Discussed with @azasypkin and he is 👍 with an additional comment about why we're bypassing audit logging (added in 409a2fa) and a followup issue for the Core team to investigate how they might expose a more performance SO client (#194435)

…tial-update

ymao1 · 2024-09-20T15:34:41Z

Please add a risk matrix to the PR description! It's a way of documenting decisions and sharing knowledge with other teams that might end up referencing this PR as inspiration.

@TinaHeiligers I've added a risk matrix to the PR description. Thanks for flagging!

ymao1 · 2024-09-23T12:02:56Z

@elasticmachine merge upstream

TinaHeiligers

Explicitly declaring a list of attributes that can be updated mitigates some of the risks involved in using the ES client directly.
LGTM

TinaHeiligers · 2024-09-23T16:45:48Z

x-pack/plugins/alerting/server/saved_objects/partially_update_rule.ts

+  options: PartiallyUpdateRuleSavedObjectOptions = {}
+): Promise<void> {
+  // ensure we only have the valid attributes that are not encrypted and are excluded from AAD
+  const attributeUpdates = omit(attributes, [


guskovaue

Tested locally. Looks good to me!

ymao1 · 2024-09-30T11:43:26Z

@elasticmachine merge upstream

kibana-ci · 2024-09-30T14:28:05Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 409a2fa

Failed CI Steps

FTR Configs #85

Test Failures

[job] [logs] FTR Configs #85 / custom visualizations self changing vis "before all" hook for "should allow updating params via the editor"

Metrics [docs]

✅ unchanged

History

💛 Build #237740 was flaky 24e5726
💛 Build #236331 was flaky ddb40c9
💛 Build #236027 was flaky f3946ff
💚 Build #235348 succeeded 0e92d58
💔 Build #235340 failed d6c5925

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

kibanamachine · 2024-09-30T14:40:23Z

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11108440052

…le run instead of SO client. (elastic#193341) Resolves elastic#192397 ## Summary Updates alerting task runner end of run updates to use the ES client update function for a true partial update instead of the saved objects client update function that performs a GET then an update. ## To verify Create a rule in multiple spaces and ensure they run correctly and their execution status and monitoring history are updated at the end of each run. Because we're performing a partial update on attributes that are not in the AAD, the rule should continue running without any encryption errors. ## Risk Matrix | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Updating saved object directly using ES client will break BWC | Medium | High | Response Ops follows an intermediate release strategy for any changes to the rule saved object where schema changes are introduced in an intermediate release before any changes to the saved object are actually made in a followup release. This ensures that any rollbacks that may be required in a release will roll back to a version that is already aware of the new schema. The team is socialized to this strategy as we are requiring users of the alerting framework to also follow this strategy. This should address any backward compatibility issues that might arise by circumventing the saved objects client update function. | | Updating saved object directly using ES client will break AAD | Medium | High | An explicit allowlist of non-AAD fields that are allowed to be partially updated has been introduced and any fields not in this allowlist will not be included in the partial update. Any updates to the rule saved object that might break AAD would show up with > 1 execution of a rule and we have a plethora of functional tests that rely on multiple executions of a rule that would flag if there were issues running due to AAD issues. | --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> (cherry picked from commit 05926c2)

kibanamachine · 2024-09-30T14:44:51Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

… of rule run instead of SO client. (#193341) (#194444) # Backport This will backport the following commits from `main` to `8.x`: - [[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. (#193341)](#193341)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Ying Mao <ying.mao@elastic.co>

ymao1 added 2 commits September 18, 2024 13:34

Using es partial update at end of rule run

c234082

Cleanup

d6c5925

ymao1 self-assigned this Sep 18, 2024

[CI] Auto-commit changed files from 'node scripts/notice'

0e92d58

ymao1 marked this pull request as ready for review September 18, 2024 19:36

ymao1 requested review from a team as code owners September 18, 2024 19:36

ymao1 requested review from pmuellr and guskovaue September 18, 2024 19:37

kc13greiner self-requested a review September 18, 2024 20:21

kc13greiner reviewed Sep 18, 2024

View reviewed changes

x-pack/plugins/alerting/server/task_runner/task_runner.ts Show resolved Hide resolved

TinaHeiligers reviewed Sep 19, 2024

View reviewed changes

pmuellr approved these changes Sep 19, 2024

View reviewed changes

ymao1 added 3 commits September 20, 2024 09:28

Merge branch 'main' of github.com:elastic/kibana into alerting/es-par…

9429448

…tial-update

Adding allowlist for attributes we can partially update with ES

4683078

Merge branch 'main' of github.com:elastic/kibana into alerting/es-par…

f3946ff

…tial-update

Merge branch 'main' into alerting/es-partial-update

ddb40c9

TinaHeiligers approved these changes Sep 23, 2024

View reviewed changes

guskovaue approved these changes Sep 25, 2024

View reviewed changes

Merging in main

24e5726

elasticmachine and others added 2 commits September 30, 2024 13:43

Merge branch 'main' into alerting/es-partial-update

7221c28

Adding comment about bypassing audit logging

409a2fa

ymao1 mentioned this pull request Sep 30, 2024

Investigate more performant SavedObjects client/repository #194435

Open

ymao1 merged commit 05926c2 into elastic:main Sep 30, 2024
41 checks passed

ymao1 deleted the alerting/es-partial-update branch September 30, 2024 14:40

kibanamachine mentioned this pull request Sep 30, 2024

[8.x] [Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. (#193341) #194444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. #193341

[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. #193341

ymao1 commented Sep 18, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Sep 18, 2024

kc13greiner left a comment •

edited

Loading

TinaHeiligers left a comment

mikecote commented Sep 19, 2024

pmuellr commented Sep 19, 2024

TinaHeiligers commented Sep 19, 2024

TinaHeiligers commented Sep 19, 2024

pmuellr left a comment

pmuellr Sep 19, 2024

TinaHeiligers Sep 19, 2024

jeramysoucy Sep 20, 2024 •

edited

Loading

jeramysoucy Sep 20, 2024

ymao1 Sep 20, 2024

ymao1 Sep 20, 2024

TinaHeiligers Sep 23, 2024

ymao1 Sep 30, 2024

ymao1 commented Sep 20, 2024

ymao1 commented Sep 23, 2024

TinaHeiligers left a comment

TinaHeiligers Sep 23, 2024

guskovaue left a comment

ymao1 commented Sep 30, 2024

kibana-ci commented Sep 30, 2024

kibanamachine commented Sep 30, 2024

kibanamachine commented Sep 30, 2024

[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. #193341

[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. #193341

Conversation

ymao1 commented Sep 18, 2024 • edited by kibanamachine Loading

Summary

To verify

Risk Matrix

elasticmachine commented Sep 18, 2024

kc13greiner left a comment • edited Loading

Choose a reason for hiding this comment

TinaHeiligers left a comment

Choose a reason for hiding this comment

mikecote commented Sep 19, 2024

pmuellr commented Sep 19, 2024

TinaHeiligers commented Sep 19, 2024

TinaHeiligers commented Sep 19, 2024

pmuellr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeramysoucy Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymao1 commented Sep 20, 2024

ymao1 commented Sep 23, 2024

TinaHeiligers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guskovaue left a comment

Choose a reason for hiding this comment

ymao1 commented Sep 30, 2024

kibana-ci commented Sep 30, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

kibanamachine commented Sep 30, 2024

kibanamachine commented Sep 30, 2024

💚 All backports created successfully

Questions ?

ymao1 commented Sep 18, 2024 •

edited by kibanamachine

Loading

kc13greiner left a comment •

edited

Loading

jeramysoucy Sep 20, 2024 •

edited

Loading