Kibana should be able to start while an Elasticsearch snapshot restore is still in progress #116255

rudolf · 2021-10-26T09:15:17Z

Waiting for a large snapshot restore to complete can take a long time. Asking users to wait with starting Kibana until the snapshot restore is complete could mean long downtime. Instead, Kibana should be able to start up after a minimum set of indices have been restored.

When a snapshot restore is in progress all indices being restored will have a "red" status and all read/write/search operations against it will fail. So plugins that perform startup logic against an index not listed below will have to wait for an index to become "yellow" before performing operations against it or alternatively, retry any failures that occur.

List of indices required for Kibana to start:

.tasks saved object migrations start several operations with wait_for_completion: false which indirectly requires this index
.security - for Kibana to authenticate against Elasticsearch
.kibana, .kibana_task_manager - saved object indices
.kibana_security_session_* - for storing user sessions

Indices that need to be evaluated:

.apm-agent-configuration Failures to create the apm agent configuration index will be retried for 17 minutes, but then fail with just an error message to logs. This would mean that if it takes 20 minutes to restore the apm agent configuration index the index might have the incorrect mappings applied. https://github.com/elastic/kibana/blob/master/x-pack/plugins/observability/server/utils/create_or_update_index.ts#L63
.apm-custom-link
.fleet*

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-26T09:15:19Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2021-10-26T09:42:09Z

The following plugins depend on the ruleRegistry plugin which exposes methods for initializing alerts as data indices. However, failures are never retried e.g. https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/server/rule_data_plugin_service/resource_installer.ts#L168

This affects the following plugins:
x-pack/plugins/apm
x-pack/plugins/infra
x-pack/plugins/observability
x-pack/plugins/security_solution
x-pack/plugins/uptime

@elastic/rac What would the effort be to ensure "alerts as data" and dependant plugins can successfully start up even if some of the alerts as data indices are still being restored from snapshot?

elasticmachine · 2021-10-26T22:17:25Z

Pinging @elastic/apm-ui (Team:apm)

weltenwort · 2021-10-27T09:36:52Z

@elastic/rac What would the effort be to ensure "alerts as data" and dependant plugins can successfully start up even if some of the alerts as data indices are still being restored from snapshot?

While I'm not on the RAC team, I've reviewed some of the code. Maybe the following would be helpful to the code owners:

The index update and write operations happen lazily when the first alert in a namespace is about to be written (more specifically the first time a "writer" is created). If that fails, the rule data client is put into a "read-only mode". A retry would therefore mean to disable the "read-only mode" and to evict the failed writer from the cache so it will be re-created on the next alert execution.

The effort depends on the guarantees we expect from the writer. If we are fine with dropping alert documents until the next retry it would be relatively straightforward to implement. If we want to enqueue the failed write operations until the indices are available, we need to modify the "write" code path more extensively.

AlexP-Elastic · 2022-08-25T21:55:31Z

@rudolf coming back to this after a long hiatus,

The following plugins depend on the ruleRegistry plugin which exposes methods for initializing alerts as data indices.

Would these not be "." indices? Would they have any recognizable characteristics at all? (always _data_content, some other metdata?)

rudolf · 2022-08-31T09:45:14Z

@AlexP-Elastic yes as far as I'm aware all these would be "." indices. Matching on all .* indices casts a rather wide net, but it might be acceptable as a first iteration and would certainly be smaller than restoring all indices in a snapshot.

Having said that there's no technical controls to enforce this. So there's nothing stopping a plugin from becoming dependant on a non-dot index being available. It would be best if we could automatically test this behaviour, but I can't see an easy way to achieve good coverage. So I think we will have to rely on communicating best practices.

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0 labels Oct 26, 2021

sorenlouv added the Team:APM All issues that need APM UI Team support label Oct 26, 2021

smith added the [zube]: 8.0 label Nov 1, 2021

sorenlouv removed the Team:APM All issues that need APM UI Team support label Nov 3, 2021

exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Nov 4, 2021

dannycroft removed the [zube]: 8.0 label Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kibana should be able to start while an Elasticsearch snapshot restore is still in progress #116255

Kibana should be able to start while an Elasticsearch snapshot restore is still in progress #116255

rudolf commented Oct 26, 2021 •

edited

Loading

elasticmachine commented Oct 26, 2021

rudolf commented Oct 26, 2021 •

edited

Loading

elasticmachine commented Oct 26, 2021

weltenwort commented Oct 27, 2021

AlexP-Elastic commented Aug 25, 2022

rudolf commented Aug 31, 2022

Kibana should be able to start while an Elasticsearch snapshot restore is still in progress #116255

Kibana should be able to start while an Elasticsearch snapshot restore is still in progress #116255

Comments

rudolf commented Oct 26, 2021 • edited Loading

elasticmachine commented Oct 26, 2021

rudolf commented Oct 26, 2021 • edited Loading

elasticmachine commented Oct 26, 2021

weltenwort commented Oct 27, 2021

AlexP-Elastic commented Aug 25, 2022

rudolf commented Aug 31, 2022

rudolf commented Oct 26, 2021 •

edited

Loading

rudolf commented Oct 26, 2021 •

edited

Loading