Skip to content

[BUG] In batch mode _cluster/allocation/explain API returns incorrect response #13990

@SwethaGuptha

Description

@SwethaGuptha

Describe the bug

API _cluster/allocation/explain is returning incorrect response on clusters with batch mode enabled because the request for shard explain allocation are being served by GatewayAllocator instead of ShardsBatchGatewayAllocator.(AllocatorFetchLogic, ExistingShardAllocatorSetting). A change in AllocationService is required to switch to the ShardsBatchGatewayAllocator when batch mode is enabled.

Issue was identified by:
Enabling index.unassigned.node_left.delayed_timeout and taking down nodes with 2 replicas of the shard, the expected response from _cluster/allocation/explain was allocation_delayed whereas the API returned awaiting_info instead.

Related component

Cluster Manager

To Reproduce

  1. Create a cluster with dedicated master and 10 data nodes.
  2. Create a test index with 2 primary and 3 replica
curl -X PUT "localhost:9200/test-ind?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 3
    }
  }
}'
  1. Enable the unassigned delayed_timeout setting
4. curl -X PUT "localhost:9200/_all/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "10m"
  }
}
  1. Get the nodes with shards for the index
curl localhost:9200/_cat/shards/test-ind
  1. Stop ES process on 2 data nodes with the replicas for shard0
  2. Get allocation response for the shard
curl -XGET 'http://localhost:9200/_cluster/allocation/explain' -H 'Content-Type: application/json' -d '{
  "index": "test-ind",
  "shard": 0,
  "primary": false
}'
  1. Validate value for can_allocate field in response is awaiting_info, response would look like this:
{"index":"test-ind","shard":0,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"NODE_LEFT","at":"2024-06-05T05:33:16.753Z","details":"node_left [Bvu-mf5XSPu3DEmv9ndBgw]","last_allocation_status":"no_attempt"},"can_allocate":"awaiting_info","allocate_explanation":"cannot allocate because information about existing shard data is still being retrieved from some of the nodes","node_allocation_decisions":[{"node_id":"3YYYQYZLQaGck1tIOJ57xg","node_name":"517c7e06d65968c38f1a4140b265ccc4","

Expected behavior

Value for can_allocate field in response is delayed_timeout

Additional Details

OpenSearch Version: 2.14

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions