[event_log] index event docs in bulk instead of individually #80941

pmuellr · 2020-10-18T19:30:21Z

resolves #55634
resolves #65746

Buffers event docs being written for a fixed interval / buffer size,
and indexes those docs via a bulk ES call.

pmuellr · 2020-10-20T18:11:35Z

I'm considering this a research spike, but seems like it's working the way we want. I might not be doing my Observables quite right :-).

One of the issues that came up with this is trying to get all the buffered docs written to ES in the case of an "orderly" shutdown of Kibana. I've made the plugin's stop() method be async and wait for the observables to complete.

The current level of code (before this PR), was likely dropping the existing "queued" documents at shutdown anyway, as they were "queued up" as an unbounded number of setImmediate()'s, each writing a single doc to ES, so we didn't really have any control over them.

mikecote · 2020-10-27T12:21:00Z

I think this would be a good optimization to get in, especially when we're talking scaling alerting and looking for efficiencies / improvements.

I've made the plugin's stop() method be async and wait for the observables to complete.

One note here, I'm not sure if the platform will wait for the stop promise. If so, they may be removing support for this soon (#74395).

pmuellr · 2020-10-28T20:37:05Z

One note here, I'm not sure if the platform will wait for the stop promise. If so, they may be removing support for this soon (#74395).

I had a slack discussion with @pgayvallet on this a week ago, as I had the same concern. Platform is currently doing an await on the invocation, but there's some signature differences when you look at the plugin lifecycle interfaces vs implementation, so it's not clear when you start looking at it. They will need to introduce some kind of timeout, like they do for setup/start today (there's a reference to some shutdown changes here). Other than that, got a 👍 on the approach in this PR.

mikecote · 2020-10-29T01:25:51Z

Awesome! Let me know when it's ready for review, I'll be happy to go over the changes 👍

resolves elastic#55634 Buffers event docs being written for a fixed interval / buffer size, and indexes those docs via a bulk ES call.

elasticmachine · 2020-11-17T03:41:12Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris

I'd recommend cleaning up some of the Rxjs lifecycle (notes added) but other than that sems to work as expected.

I don't think there's a way to actually test this locally other than checking "by eye" that it uses bulk.. so LGTM 🤷

gmmorris · 2020-11-17T12:17:48Z

x-pack/plugins/event_log/server/es/cluster_client_adapter.ts

+  async indexDocuments(docs: Doc[]): Promise<void> {
+    // if es initialization failed, don't try to index
+    if (!(await this.context.waitTillReady())) {
+      return;


Should this throw an error instead of just returning silently? 🤔
Or at least log it?

Originally, it did log. However, I realized that it would log anytime the messages got flushed, and the entire pipeline will still run even if the initialization failed (indicated by waitTillReady() resolving to false). The entire pipeline still runs because the code path for logging an event from a client is now **completely** synchronous, replacing the old setImmediate()forking of the log writing, to the newnext()on the observable - this is fabulous! But also means we can't checkwaitTillReady()` until we get to this point. And we won't want to be spamming an error message here :-)

So, instead, I put one of these waitTillReady() calls with a single log message in the plugin code itself:

kibana/x-pack/plugins/event_log/server/plugin.ts

Lines 120 to 125 in 1a7134d

// log an error if initialiization didn't succeed

this.esContext.waitTillReady().then((success) => {

if (!success) {

this.systemLogger.error(`initialization failed, events will not be indexed`);

}

});

I'm not convinced this is the best approach, perhaps waitTillReady() should do the logging itself, in it's error paths, but then that's really too specific - I'm kinda treating waitTillReady() as a boolean getter at this point, felt safer to not have it log specific outcomes itself as a side effect.

But clearly needs a comment, in both locations, it's certainly not completely obvious.

gmmorris · 2020-11-17T12:29:28Z

x-pack/plugins/event_log/server/es/cluster_client_adapter.ts

+    await this.doneWriting.wait();
+    this.docsBufferedSubscription.unsubscribe();


Line 81 is actually redundant - once the docBuffer$ completes it will reactively unsubscribe docsBufferedSubscription so no need to call this explicitly.

gmmorris · 2020-11-17T12:40:37Z

x-pack/plugins/event_log/server/es/cluster_client_adapter.ts

+    const docsBuffered$ = this.docBuffer$.pipe(
+      bufferTime(EVENT_BUFFER_TIME, null, EVENT_BUFFER_LENGTH),
+      filter((docs) => docs.length > 0),
+      switchMap(async (docs) => await this.indexDocuments(docs))
+    );


You can actually use this instead of the ReadySignal.

All Observables expose a toPromise which returns a promise which resolves when they complete.
As docBuffered$ completes when the docBuffer$ completes, you can do something like:

this.docBufferedFlushed = docsBuffered$.toPromise();

and then in shutdown() you can do:

wait this.docBufferedFlushed;

pmuellr · 2020-11-17T13:30:33Z

I don't think there's a way to actually test this locally other than checking "by eye" that it uses bulk.. so LGTM 🤷

There are some tests that the bulk is working, for time and count at the cluster_client_adapter level anyway - that seemed good enough to me:

kibana/x-pack/plugins/event_log/server/es/cluster_client_adapter.test.ts

Lines 79 to 80 in 1a7134d

    
           describe('buffering documents', () => { 
        
             test('should write buffered docs after timeout', async () => {

What turned out to be impossible to test, is the new plugin shutdown code that flushes the remaining events. We do test at the cluster_client_adapter level, but the plugin stop test is a bit lame, but best I could figure out ATM. I think I searched and no one else was really testing plugin stop, but that likely makes sense because who else is doing significant work in plugin stop? That reminds me, I guess I should have someone in platform take a look at this bit ... :-)

I have been testing the stop flushing "by eye". Kill Kibana, and you'll now see the plugin's own "stopping" event gets logged, which never happened before! (check by curling the event log directly via es)

ymao1

LGTM! Verified that it works as expected

pgayvallet · 2020-11-17T15:41:33Z

x-pack/plugins/event_log/server/plugin.ts

+    this.systemLogger.info('shutdown: waiting to finish');
+    await this.esContext?.shutdown();
+    this.systemLogger.info('shutdown: finished');


What is esContext?.shutdown doing exactly?

Just want to point out that plugin.stop will only be called in case of graceful shutdowns. Process termination will obviously not call this, or the process may even be killed during this invocation. So this cleanup needs to be resilient to such scenarios.

esContext.shutdown() will "stop" the rxjs pipeline batching event log documents, and wait till the final batch is written out (bulk indexed into the event log index). Assuming the general rxjs "complete" processing is very quick, the only latency will be from the bulk index call, which would be max 100 docs (max ~1K in size)

Totally understand that it only gets called under "nice" circumstances. We don't really have a better story though, and given the current buffering (100 docs or 1 sec elapsed), it's shouldn't lose too much even in an OOM kind of non-clean shutdown.

The event log is historic data, and we've always treated it as not critical data - it's not a source of truth.

And I think the story with this is better than what it's replacing, which was doing setImmediate()s to force the indexing of individual documents off the main loop, which could have gotten really chaotic (but we never saw that).

We're also now getting the "event log stopped" message that we've been logging since the very beginning, actually indexed - before this, it was never indexed, as presumably the Kibana process never waited for unfinished setImmediate() processing to finish (and rightly so!).

Still, it does scare me a bit!

I agree that awaiting for this in case of 'normal' shutdown is way better than a setImmediate implementation (which was likely never finishing as the delay between plugins shutdown and termination is rather quick).

Created #83612 FYI

mikecote · 2020-11-19T12:45:37Z

@elasticmachine merge upstream

kibanamachine · 2020-11-20T01:04:47Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: dd581a5

Metrics [docs]

Distributable file count

id	before	after	diff
`default`	42887	42886	-1

History

💔 Build #89121 failed 0e2ac83
💚 Build #88895 succeeded 2e46196
💚 Build #88295 succeeded 1a7134d
💚 Build #87998 succeeded 1a7134d
💚 Build #87859 succeeded 93536e0add5d6fb79c363736cdb899b77e040087

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…#80941) resolves elastic#55634 resolves elastic#65746 Buffers event docs being written for a fixed interval / buffer size, and indexes those docs via a bulk ES call. Also now flushing those buffers at plugin stop() time, which we couldn't do before with the single index calls, which were run via `setImmediate()`.

…80941)" This reverts commit 5bfe665.

spalger · 2020-11-20T02:22:41Z

Sorry @pmuellr, but once this was merged it started to cause type and jest failures on master, and the same failures popped up in your backport, so I reverted the PR and request that you resubmit the PR. I think it would make sense to close the backport and backport the second PR too.

Failures: https://kibana-ci.elastic.co/job/elastic+kibana+pipeline-pull-request/89145/execution/node/385/log/
Master build: https://kibana-ci.elastic.co/job/elastic+kibana+master/9788/

resolves elastic#55634 resolves elastic#65746 Buffers event docs being written for a fixed interval / buffer size, and indexes those docs via a bulk ES call. Also now flushing those buffers at plugin stop() time, which we couldn't do before with the single index calls, which were run via `setImmediate()`. This is a redo of PR elastic#80941 which had to be reverted.

pmuellr · 2020-11-20T15:53:33Z

replacement PR here: #83927

Had to be reverted because I didn't merge upstream after PR #81891 got merged, which changed the spaces plugin setup/start, and the reverted PR contained a new test module for the plugin. Had to change where the spaces plugin is passed around, in setup/start

…83927) resolves #55634 resolves #65746 Buffers event docs being written for a fixed interval / buffer size, and indexes those docs via a bulk ES call. Also now flushing those buffers at plugin stop() time, which we couldn't do before with the single index calls, which were run via `setImmediate()`. This is a redo of PR #80941 which had to be reverted.

…lastic#83927) resolves elastic#55634 resolves elastic#65746 Buffers event docs being written for a fixed interval / buffer size, and indexes those docs via a bulk ES call. Also now flushing those buffers at plugin stop() time, which we couldn't do before with the single index calls, which were run via `setImmediate()`. This is a redo of PR elastic#80941 which had to be reverted.

…83927) (#83962) resolves #55634 resolves #65746 Buffers event docs being written for a fixed interval / buffer size, and indexes those docs via a bulk ES call. Also now flushing those buffers at plugin stop() time, which we couldn't do before with the single index calls, which were run via `setImmediate()`. This is a redo of PR #80941 which had to be reverted.

pmuellr mentioned this pull request Oct 21, 2020

Investigate event log write performance / stress testing #65746

Closed

pmuellr force-pushed the event-log/rx-buffer branch 2 times, most recently from 93536e0 to 4ede838 Compare November 16, 2020 14:14

[event_log] index event docs in bulk instead of individually

1a7134d

resolves elastic#55634 Buffers event docs being written for a fixed interval / buffer size, and indexes those docs via a bulk ES call.

pmuellr force-pushed the event-log/rx-buffer branch from 4ede838 to 1a7134d Compare November 16, 2020 14:20

pmuellr added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.11.0 v8.0.0 labels Nov 17, 2020

pmuellr marked this pull request as ready for review November 17, 2020 03:41

pmuellr requested a review from a team as a code owner November 17, 2020 03:41

gmmorris self-requested a review November 17, 2020 11:59

gmmorris approved these changes Nov 17, 2020

View reviewed changes

ymao1 approved these changes Nov 17, 2020

View reviewed changes

pgayvallet reviewed Nov 17, 2020

View reviewed changes

pgayvallet approved these changes Nov 18, 2020

View reviewed changes

pgayvallet mentioned this pull request Nov 18, 2020

Change Plugin.stop signature to allow asynchronous stop lifecycle #83612

Closed

kibanamachine and others added 3 commits November 19, 2020 07:45

Merge branch 'master' into event-log/rx-buffer

2e46196

resolve PR comments

0e2ac83

more fixes for PR comments

dd581a5

pmuellr merged commit 5bfe665 into elastic:master Nov 20, 2020

pmuellr mentioned this pull request Nov 20, 2020

[7.x] [event_log] index event docs in bulk instead of individually (#80941) #83879

Closed

spalger added the reverted label Nov 20, 2020

spalger added a commit that referenced this pull request Nov 20, 2020

Revert "[event_log] index event docs in bulk instead of individually (#…

2fef237

…80941)" This reverts commit 5bfe665.

pmuellr mentioned this pull request Nov 20, 2020

[event_log] index event docs in bulk instead of individually (redo) #83927

Merged

pmuellr added the backport:skip This PR does not require backporting label Nov 20, 2020

	// log an error if initialiization didn't succeed
	this.esContext.waitTillReady().then((success) => {
	if (!success) {
	this.systemLogger.error(`initialization failed, events will not be indexed`);
	}
	});

		await this.doneWriting.wait();
		this.docsBufferedSubscription.unsubscribe();

[event_log] index event docs in bulk instead of individually #80941

[event_log] index event docs in bulk instead of individually #80941

Uh oh!

Conversation

pmuellr commented Oct 18, 2020 • edited by mikecote Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmuellr commented Oct 20, 2020

Uh oh!

mikecote commented Oct 27, 2020

Uh oh!

pmuellr commented Oct 28, 2020

Uh oh!

mikecote commented Oct 29, 2020

Uh oh!

elasticmachine commented Nov 17, 2020

Uh oh!

gmmorris left a comment

Choose a reason for hiding this comment

Uh oh!

gmmorris Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

pmuellr Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

gmmorris Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

gmmorris Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

pmuellr commented Nov 17, 2020

Uh oh!

ymao1 left a comment

Choose a reason for hiding this comment

Uh oh!

pgayvallet Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmuellr Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

pgayvallet Nov 18, 2020

Choose a reason for hiding this comment

Uh oh!

pgayvallet Nov 18, 2020

Choose a reason for hiding this comment

Uh oh!

mikecote commented Nov 19, 2020

Uh oh!

kibanamachine commented Nov 20, 2020

💚 Build Succeeded

Metrics [docs]

Distributable file count

History

Uh oh!

spalger commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmuellr commented Nov 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pmuellr commented Oct 18, 2020 •

edited by mikecote

Loading

pgayvallet Nov 17, 2020 •

edited

Loading

spalger commented Nov 20, 2020 •

edited

Loading