Open
Description
This meta issue tracks known issues with scaling clusters to large numbers of shards.
-
Security
-
General
- Push back on excessive requests for stats #51992
- Add pagination to diagnostic APIs #87555
- Make Cache More Memory Efficient #77546
- Cluster Stats API Slows down Considerably for Larger Clusters #79563
- Reduce merging in PersistedClusterStateService #79793
- Make org.elasticsearch.action.admin.cluster.state.ClusterStateResponse Compress the Cluster State #79906
- optimize getIndices in IndicesSegmentResponse #80064
- Massive async shard fetch requests consume lots of heap memories on master node. #80694
- Pending task batching can be a bottleneck #81626
- Batch add-block-index-to-close, add-index-block and finalize-index-block tasks #81627
- Stop unnecessary retries of shard-started tasks #81628
- Batch up master tasks to create, mount, and delete snapshots #81846
- Batch up failure-related ILM master tasks #81880
- Stats actions should discard intermediate state on cancellation #82337
-
RestClusterGetSettingsAction
Requests the Full Metadata from Master #82342 - More Compact Serialization of Metadata #82608
- Make TaskBatcher Less Lock-Heavy #82227
- Speed up MappingStats Computation on Coordinating Node #82830
- Add level=datastreams to Indices Stats #83049
- (Re)Starting a Data Node Holding a Large Number of Indices can Take Minutes #83203
- A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204 -> Reduce resource needs of join validation #85380
- RecoverySourceHandler#runWithGenericThreadPool caused deadlock #85839
- Report stats related to new sizing guidance #86639
- Make GetIndexAction cancellable #87681
- Drop
ClusterStateHealth#indices
when unnecessary #90631 - Improve scalability of BroadcastReplicationActions #92902
- Computing IndicesQueryCache stats is O(N²) in shard count #97222
- TransportBroadcastByNodeAction does O(#shards) work on transport worker thread #97914
- Reduce usage of
TransportMasterNodeReadAction
#101805
-
Snapshots + SLM
- Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942
- Batch Snapshot Finalizations #82824
- Snapshot Deletion Could Run more Concurrently to Snapshot Creation #82853
- Add Parameter to not Return Index Name Lists in GET Snapshots API #82937
- Avoid capturing SnapshotsInProgress$Entry in queue #88707
- Make
SnapshotsInProgress
diffable #88732 - Make snapshot deletes less memory intensive by reordering repository metadata updates #89163
- Snapshot creations have huge heap footprint after abrupt full-cluster restart #89952
- Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953
-
Metrics
-
ILM + Allocation
- Simple ILM Task Batching Implementation #78547
- Optimize DataTierAllocationDecider Further #78235
- ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246
- Speed up DataTierAllocationDecider #78075
- Implement DiffableStringMap's get and containsKey in terms of the wrapped innerMap #77965
- Improve LifecycleExecutionState parsing. #77855
- Reduce the number of times that
LifecycleExecutionState
is parsed when running a policy #77863 - Speed up toXContent Collection Serialization in some Spots #78742
- Store DataTier Preference directly on IndexMetadata #78668
- Store Disk Threshold Ignore Setting in IndexMetadata #78672
- Speed up Routing Nodes Priority Comparator #78609
- Allow indices lookup to be built lazily #78745
- Optimize XContent Object Parsers #78813
- Find a way to Deduplicate Index Settings #78892
- MasterService#patchVersions is rather inefficient #77888
- Replace RoutingTable#shardsWithState(...) with RoutingNodes#unassigned(...) #78931
- Speedup computing cluster health #78969
- IndexMetadataUpdater#applyChanges is rather inefficient #78980
- Batch Cluster State Updates in Datastream Rollover #79782
- Batch Index Settings Update Requests #79866
- Make MasterService.patchVersions not Rebuild the Full CS #79860
- Save some RoutingNodes Instantiations #79941
- Cache
DiscoveryNode#trimTier
Result #80179 - Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407
- Batch up failure-related ILM master tasks #81880
- Faster ShardsLimitAllocationDecider #82251
- Large ILM Task Batches are Executed too Slowly #82708
- Make AllocationService#adaptAutoExpandReplicas Faster #83092
- Speed up Building Indices Lookup in Metadata #83241
- Speed up Name Collision Check in Metadata.Builder #83340
- Make LIFECYCLE_NAME_SETTING a Field in IndexMetadata #83582
- Use static empty store files metadata #84034
-
MetadataIndexAliasesService
submits unbatched tasks at URGENT priority #89924 - Improve sharing and diffability of IndexRoutingTable #94933
- DataTiersUsageTransportAction is incredibly inefficient in large clusters #100230
-
Search
- Group shard request per node in the field capabilities API #74648
- Group shard request per node in the can match phase #78164
- Intern IndexFieldCapabilities Type String on Read #76405
- Fix NumberFieldMapper Referencing its Own Builder #77131
- Fix MatchOnlyTextFieldMapper Retaining a Reference to its Builder #77201
- Fix TextFieldMapper Retaining a Reference to its Builder #77251
- Search Responses to Many Shards use Excessive Amounts of Memory for OriginalIndices instances #78314
- Filter original indices in shard level request #78508
- Merge field caps responses on each node? #82879
- Updating index metadata does an expensive validation of the mapping even if unchanged #89309
- Searches against a large number of unavailable shards result in very large responses #90622
- Reduce copying when creating scroll/PIT ids #99219
- Limit shard failures accumulated by searches #99220
- Fork response-sending in OpenPointInTimeAction #99222
-
Network
- Sending Large Transport Messages Should be Optimized #82245
- APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560
- Render Mappings more Compact in GET /_cluster/state #83846
- Add the Ability to Disable certain REST APIs via a Cluster Setting #84876
- Track distribution of REST response sizes #84887
- Make use of chunked REST response infrastructure in more APIs #89838
Metadata
Metadata
Assignees
Labels
Index and Snapshot lifecycle managementA catch all label for anything in the Distributed Indexing Area. Please avoid if you can.Roles, Privileges, DLS/FLS, RBAC/ABACMeta label for data/management teamMeta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.Meta label for Distributed Indexing teamMeta label for security team