-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kibana should be able to start while an Elasticsearch snapshot restore is still in progress #116255
Comments
Pinging @elastic/kibana-core (Team:Core) |
The following plugins depend on the This affects the following plugins: @elastic/rac What would the effort be to ensure "alerts as data" and dependant plugins can successfully start up even if some of the alerts as data indices are still being restored from snapshot? |
Pinging @elastic/apm-ui (Team:apm) |
While I'm not on the RAC team, I've reviewed some of the code. Maybe the following would be helpful to the code owners: The index update and write operations happen lazily when the first alert in a namespace is about to be written (more specifically the first time a "writer" is created). If that fails, the rule data client is put into a "read-only mode". A retry would therefore mean to disable the "read-only mode" and to evict the failed writer from the cache so it will be re-created on the next alert execution. The effort depends on the guarantees we expect from the writer. If we are fine with dropping alert documents until the next retry it would be relatively straightforward to implement. If we want to enqueue the failed write operations until the indices are available, we need to modify the "write" code path more extensively. |
@rudolf coming back to this after a long hiatus,
Would these not be "." indices? Would they have any recognizable characteristics at all? (always |
@AlexP-Elastic yes as far as I'm aware all these would be "." indices. Matching on all Having said that there's no technical controls to enforce this. So there's nothing stopping a plugin from becoming dependant on a non-dot index being available. It would be best if we could automatically test this behaviour, but I can't see an easy way to achieve good coverage. So I think we will have to rely on communicating best practices. |
Waiting for a large snapshot restore to complete can take a long time. Asking users to wait with starting Kibana until the snapshot restore is complete could mean long downtime. Instead, Kibana should be able to start up after a minimum set of indices have been restored.
When a snapshot restore is in progress all indices being restored will have a "red" status and all read/write/search operations against it will fail. So plugins that perform startup logic against an index not listed below will have to wait for an index to become "yellow" before performing operations against it or alternatively, retry any failures that occur.
List of indices required for Kibana to start:
.tasks
saved object migrations start several operations withwait_for_completion: false
which indirectly requires this index.security
- for Kibana to authenticate against Elasticsearch.kibana
,.kibana_task_manager
- saved object indices.kibana_security_session_*
- for storing user sessionsIndices that need to be evaluated:
.apm-agent-configuration
Failures to create the apm agent configuration index will be retried for 17 minutes, but then fail with just an error message to logs. This would mean that if it takes 20 minutes to restore the apm agent configuration index the index might have the incorrect mappings applied. https://github.com/elastic/kibana/blob/master/x-pack/plugins/observability/server/utils/create_or_update_index.ts#L63.apm-custom-link
.fleet*
The text was updated successfully, but these errors were encountered: