Skip to content

Retries in blocksStoreQuerier.queryWithConsistencyCheck() doesn't query all zones #5468

Closed
@harry671003

Description

@harry671003

Describe the bug
With zone aware replication and with RF > number of zones, it's possible that the three retries for the same block can go to replicas in the same zone.

This can be problematic during:

  • An AZ outage - even if we have replicas in healthy AZs, we might never query them.
  • A Zone based deployment - Multiple store-gateways from the same zone can be brought down and the 3 retries might all hit the store-gateways that are down.

Assume we have 9 replicas for a block:

  • sg1 (AZ1)
  • sg2 (AZ1)
  • sg3 (AZ1)
  • sg4 (AZ2)
  • sg5 (AZ2)
  • sg6 (AZ2)
  • sg7 (AZ3)
  • sg8 (AZ3)
  • sg9 (AZ3)

Assume AZ1 is down and sg1, sg2 and sg3 are not available.
The retry logic picks a random store-gateway from the list and it's possible that all three retries go to the store-gateways in AZ1.

Relavant Code:

To Reproduce
Steps to reproduce the behavior:

  1. Enable zone aware replication
  2. Set RF to 9
  3. Bring down multiple store-gateway in the same AZ.

Expected behavior

  • An AZ outage shouldn't fail a query if there are replicas in other AZs.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions