You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanos, Prometheus and Golang version used: 0.28.0-rc.0
What happened:
I was trying out the new RC locally with --receive.hashrings-algorithm=ketama, 6 replicas with replication factor 3. During my tests, some of my replicas were never able to get into a ready state.
After more digging I found out it occurs when my setup got into a state where number of endpoints in the hashring was lower than replication factor. I think there is twofold problem here, depending on which hashing algorithm is used:
In my case I used --receive.hashrings-algorithm=ketama. This caused the hashring creation logic to get into an infinte loop, since we're not able to pre-calculate replica for sections. This causes the hashring change channel to block forever and to never obtain hashring configuration, meaning although receiver is running, it will never become ready without storage being initialized
In case of using the default hashmod algorithm, this issue might not be so obvious, since we're not doing such pre-calculation. However, it still would mean some replication requests are landing on same nodes, which is not a desired behavior
What you expected to happen:
I'd expect receiver not to hang forever (in case of Ketama algorithm) and to handle configuration where the replication factor cannot be guaranteed (e.g. log an error, exit receiver).
How to reproduce it (as minimally and precisely as possible):
Create hashring configuration with 2 endpoints
Create a receiver setup with--receive.hashrings-algorithm=ketama and --receive.replication-factor=3
Watch the receiver replicas never becoming ready
Anything else we need to know:
I noticed when I was experimenting with https://github.com/observatorium/thanos-receive-controller, which automatically changes the hashring, but users could hit this issue even with erroneous hashring config files
The text was updated successfully, but these errors were encountered:
Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Thanos, Prometheus and Golang version used:
0.28.0-rc.0
What happened:
I was trying out the new RC locally with
--receive.hashrings-algorithm=ketama
, 6 replicas with replication factor 3. During my tests, some of my replicas were never able to get into a ready state.After more digging I found out it occurs when my setup got into a state where number of endpoints in the hashring was lower than replication factor. I think there is twofold problem here, depending on which hashing algorithm is used:
--receive.hashrings-algorithm=ketama
. This caused the hashring creation logic to get into an infinte loop, since we're not able to pre-calculate replica for sections. This causes the hashring change channel to block forever and to never obtain hashring configuration, meaning although receiver is running, it will never become ready without storage being initializedWhat you expected to happen:
I'd expect receiver not to hang forever (in case of Ketama algorithm) and to handle configuration where the replication factor cannot be guaranteed (e.g. log an error, exit receiver).
How to reproduce it (as minimally and precisely as possible):
--receive.hashrings-algorithm=ketama
and--receive.replication-factor=3
Anything else we need to know:
I noticed when I was experimenting with https://github.com/observatorium/thanos-receive-controller, which automatically changes the hashring, but users could hit this issue even with erroneous hashring config files
The text was updated successfully, but these errors were encountered: