Receiver: Running with number of replicas lower than replication factor in a hashring is accepted #5639

matej-g · 2022-08-24T14:51:26Z

Thanos, Prometheus and Golang version used:
0.28.0-rc.0

What happened:
I was trying out the new RC locally with --receive.hashrings-algorithm=ketama, 6 replicas with replication factor 3. During my tests, some of my replicas were never able to get into a ready state.

After more digging I found out it occurs when my setup got into a state where number of endpoints in the hashring was lower than replication factor. I think there is twofold problem here, depending on which hashing algorithm is used:

In my case I used --receive.hashrings-algorithm=ketama. This caused the hashring creation logic to get into an infinte loop, since we're not able to pre-calculate replica for sections. This causes the hashring change channel to block forever and to never obtain hashring configuration, meaning although receiver is running, it will never become ready without storage being initialized
In case of using the default hashmod algorithm, this issue might not be so obvious, since we're not doing such pre-calculation. However, it still would mean some replication requests are landing on same nodes, which is not a desired behavior

What you expected to happen:
I'd expect receiver not to hang forever (in case of Ketama algorithm) and to handle configuration where the replication factor cannot be guaranteed (e.g. log an error, exit receiver).

How to reproduce it (as minimally and precisely as possible):

Create hashring configuration with 2 endpoints
Create a receiver setup with--receive.hashrings-algorithm=ketama and --receive.replication-factor=3
Watch the receiver replicas never becoming ready

Anything else we need to know:
I noticed when I was experimenting with https://github.com/observatorium/thanos-receive-controller, which automatically changes the hashring, but users could hit this issue even with erroneous hashring config files

The text was updated successfully, but these errors were encountered:

stale · 2022-11-13T15:10:26Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

JayChanggithub · 2023-02-24T14:25:51Z

Hi @matej-g
I seems like meet symptoms as your. Did you have get resolved? Could you shared parameters of receiver and hashing?

MichaHoffmann · 2023-02-27T16:24:33Z

Created #6168 for it now.

dmilind · 2024-01-12T23:47:19Z

Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.

MichaHoffmann · 2024-01-13T09:20:24Z

Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.

Are you experiencing deadlock or does receiver fail to start with an error?

matej-g added bug component: receive labels Aug 24, 2022

matej-g self-assigned this Aug 24, 2022

stale bot added the stale label Nov 13, 2022

stale bot removed the stale label Feb 24, 2023

MichaHoffmann mentioned this issue Feb 27, 2023

receive: fail early if ketama hashring is configured with number of nodes lower than the replication factor #6168

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receiver: Running with number of replicas lower than replication factor in a hashring is accepted #5639

Receiver: Running with number of replicas lower than replication factor in a hashring is accepted #5639

matej-g commented Aug 24, 2022 •

edited

Loading

stale bot commented Nov 13, 2022

JayChanggithub commented Feb 24, 2023

MichaHoffmann commented Feb 27, 2023

dmilind commented Jan 12, 2024

MichaHoffmann commented Jan 13, 2024

Receiver: Running with number of replicas lower than replication factor in a hashring is accepted #5639

Receiver: Running with number of replicas lower than replication factor in a hashring is accepted #5639

Comments

matej-g commented Aug 24, 2022 • edited Loading

stale bot commented Nov 13, 2022

JayChanggithub commented Feb 24, 2023

MichaHoffmann commented Feb 27, 2023

dmilind commented Jan 12, 2024

MichaHoffmann commented Jan 13, 2024

matej-g commented Aug 24, 2022 •

edited

Loading