Skip to content

Avoid expensive operations when constructing an Engine #43699

Closed
@DaveCTurner

Description

@DaveCTurner

Today (5f6321a) when constructing an InternalEngine we perform some potentially expensive operations, including:

	at org.elasticsearch.index.IndexWarmer.warm(IndexWarmer.java:81)
	at org.elasticsearch.index.IndexService.lambda$createShard$4(IndexService.java:402)
	at org.elasticsearch.index.engine.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:2244)
	at org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:198)
	at org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.<init>(InternalEngine.java:326)
	at org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:598)
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:238)
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:184)

and

	at org.elasticsearch.index.engine.InternalEngine.restoreVersionMapAndCheckpointTracker(InternalEngine.java:2820)
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:257)
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:184)

We create the Engine under IndexShard#mutex:

synchronized (mutex) {
verifyNotClosed();
assert currentEngineReference.get() == null : "engine is running";
// we must create a new engine under mutex (see IndexShard#snapshotStoreMetadata).
final Engine newEngine = engineFactory.newReadWriteEngine(config);
onNewEngine(newEngine);
currentEngineReference.set(newEngine);
// We set active because we are now writing operations to the engine; this way,
// if we go idle after some time and become inactive, we still give sync'd flush a chance to run.
active.set(true);
}

This can block cluster state updates, because IndexShard#updateShardState requires the same mutex:

public void updateShardState(final ShardRouting newRouting,
final long newPrimaryTerm,
final BiConsumer<IndexShard, ActionListener<ResyncTask>> primaryReplicaSyncer,
final long applyingClusterStateVersion,
final Set<String> inSyncAllocationIds,
final IndexShardRoutingTable routingTable) throws IOException {
final ShardRouting currentRouting;
synchronized (mutex) {

We should survey the things we do during the startup of all of the engines and make sure that none of them will block for too long.

Relates https://discuss.elastic.co/t/187604 in which the engine takes multiple minutes to start up, because it's loading global ordinals, and this blocks a cluster state update for long enough that the node is removed from the cluster.

Metadata

Metadata

Assignees

Labels

:Distributed Indexing/EngineAnything around managing Lucene and the Translog in an open shard.>bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions