Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kibana should be able to start while an Elasticsearch snapshot restore is still in progress #116255

Open
rudolf opened this issue Oct 26, 2021 · 6 comments
Labels
impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0

Comments

@rudolf
Copy link
Contributor

rudolf commented Oct 26, 2021

Waiting for a large snapshot restore to complete can take a long time. Asking users to wait with starting Kibana until the snapshot restore is complete could mean long downtime. Instead, Kibana should be able to start up after a minimum set of indices have been restored.

When a snapshot restore is in progress all indices being restored will have a "red" status and all read/write/search operations against it will fail. So plugins that perform startup logic against an index not listed below will have to wait for an index to become "yellow" before performing operations against it or alternatively, retry any failures that occur.

List of indices required for Kibana to start:

  • .tasks saved object migrations start several operations with wait_for_completion: false which indirectly requires this index
  • .security - for Kibana to authenticate against Elasticsearch
  • .kibana, .kibana_task_manager - saved object indices
  • .kibana_security_session_* - for storing user sessions

Indices that need to be evaluated:

  1. .apm-agent-configuration Failures to create the apm agent configuration index will be retried for 17 minutes, but then fail with just an error message to logs. This would mean that if it takes 20 minutes to restore the apm agent configuration index the index might have the incorrect mappings applied. https://github.com/elastic/kibana/blob/master/x-pack/plugins/observability/server/utils/create_or_update_index.ts#L63
  2. .apm-custom-link
  3. .fleet*
@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0 labels Oct 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor Author

rudolf commented Oct 26, 2021

The following plugins depend on the ruleRegistry plugin which exposes methods for initializing alerts as data indices. However, failures are never retried e.g. https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/server/rule_data_plugin_service/resource_installer.ts#L168

This affects the following plugins:
x-pack/plugins/apm
x-pack/plugins/infra
x-pack/plugins/observability
x-pack/plugins/security_solution
x-pack/plugins/uptime

@elastic/rac What would the effort be to ensure "alerts as data" and dependant plugins can successfully start up even if some of the alerts as data indices are still being restored from snapshot?

@sorenlouv sorenlouv added the Team:APM All issues that need APM UI Team support label Oct 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@weltenwort
Copy link
Member

@elastic/rac What would the effort be to ensure "alerts as data" and dependant plugins can successfully start up even if some of the alerts as data indices are still being restored from snapshot?

While I'm not on the RAC team, I've reviewed some of the code. Maybe the following would be helpful to the code owners:

The index update and write operations happen lazily when the first alert in a namespace is about to be written (more specifically the first time a "writer" is created). If that fails, the rule data client is put into a "read-only mode". A retry would therefore mean to disable the "read-only mode" and to evict the failed writer from the cache so it will be re-created on the next alert execution.

The effort depends on the guarantees we expect from the writer. If we are fine with dropping alert documents until the next retry it would be relatively straightforward to implement. If we want to enqueue the failed write operations until the indices are available, we need to modify the "write" code path more extensively.

@sorenlouv sorenlouv removed the Team:APM All issues that need APM UI Team support label Nov 3, 2021
@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Nov 4, 2021
@AlexP-Elastic
Copy link

@rudolf coming back to this after a long hiatus,

The following plugins depend on the ruleRegistry plugin which exposes methods for initializing alerts as data indices.

Would these not be "." indices? Would they have any recognizable characteristics at all? (always _data_content, some other metdata?)

@rudolf
Copy link
Contributor Author

rudolf commented Aug 31, 2022

@AlexP-Elastic yes as far as I'm aware all these would be "." indices. Matching on all .* indices casts a rather wide net, but it might be acceptable as a first iteration and would certainly be smaller than restoring all indices in a snapshot.

Having said that there's no technical controls to enforce this. So there's nothing stopping a plugin from becoming dependant on a non-dot index being available. It would be best if we could automatically test this behaviour, but I can't see an easy way to achieve good coverage. So I think we will have to rely on communicating best practices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0
Projects
None yet
Development

No branches or pull requests

7 participants