Retries in blocksStoreQuerier.queryWithConsistencyCheck() doesn't query all zones

**Describe the bug**
With zone aware replication and with RF > number of zones, it's possible that the three retries for the same block can go to replicas in the same zone.

This can be problematic during:
- An AZ outage - even if we have replicas in healthy AZs, we might never query them.
- A Zone based deployment - Multiple store-gateways from the same zone can be brought down and the 3 retries might all hit the store-gateways that are down.

Assume we have 9 replicas for a block:
- sg1 (AZ1)
- sg2 (AZ1)
- sg3 (AZ1)
- sg4 (AZ2)
- sg5 (AZ2)
- sg6 (AZ2)
- sg7 (AZ3)
- sg8 (AZ3)
- sg9 (AZ3)

Assume AZ1 is down and sg1, sg2 and sg3 are not available.
The retry logic picks a random store-gateway from the list and it's possible that all three retries go to the store-gateways in AZ1.

Relavant Code:
- Retry Logic: https://github.com/cortexproject/cortex/blob/master/pkg/querier/blocks_store_queryable.go#L505
- Picking random store-gateways: https://github.com/cortexproject/cortex/blob/master/pkg/querier/blocks_store_replicated_set.go#L147

**To Reproduce**
Steps to reproduce the behavior:
1. Enable zone aware replication
2. Set RF to 9
3. Bring down multiple store-gateway in the same AZ.

**Expected behavior**
- An AZ outage shouldn't fail a query if there are replicas in other AZs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retries in blocksStoreQuerier.queryWithConsistencyCheck() doesn't query all zones #5468

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retries in blocksStoreQuerier.queryWithConsistencyCheck() doesn't query all zones #5468

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions