Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiver: Running with number of replicas lower than replication factor in a hashring is accepted #5639

Open
matej-g opened this issue Aug 24, 2022 · 5 comments

Comments

@matej-g
Copy link
Collaborator

matej-g commented Aug 24, 2022

Thanos, Prometheus and Golang version used:
0.28.0-rc.0

What happened:
I was trying out the new RC locally with --receive.hashrings-algorithm=ketama, 6 replicas with replication factor 3. During my tests, some of my replicas were never able to get into a ready state.

After more digging I found out it occurs when my setup got into a state where number of endpoints in the hashring was lower than replication factor. I think there is twofold problem here, depending on which hashing algorithm is used:

  • In my case I used --receive.hashrings-algorithm=ketama. This caused the hashring creation logic to get into an infinte loop, since we're not able to pre-calculate replica for sections. This causes the hashring change channel to block forever and to never obtain hashring configuration, meaning although receiver is running, it will never become ready without storage being initialized
  • In case of using the default hashmod algorithm, this issue might not be so obvious, since we're not doing such pre-calculation. However, it still would mean some replication requests are landing on same nodes, which is not a desired behavior

What you expected to happen:
I'd expect receiver not to hang forever (in case of Ketama algorithm) and to handle configuration where the replication factor cannot be guaranteed (e.g. log an error, exit receiver).

How to reproduce it (as minimally and precisely as possible):

  • Create hashring configuration with 2 endpoints
  • Create a receiver setup with--receive.hashrings-algorithm=ketama and --receive.replication-factor=3
  • Watch the receiver replicas never becoming ready

Anything else we need to know:
I noticed when I was experimenting with https://github.com/observatorium/thanos-receive-controller, which automatically changes the hashring, but users could hit this issue even with erroneous hashring config files

@stale
Copy link

stale bot commented Nov 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Nov 13, 2022
@JayChanggithub
Copy link

Hi @matej-g
I seems like meet symptoms as your. Did you have get resolved? Could you shared parameters of receiver and hashing?

@stale stale bot removed the stale label Feb 24, 2023
@MichaHoffmann
Copy link
Contributor

Created #6168 for it now.

@dmilind
Copy link

dmilind commented Jan 12, 2024

Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.

@MichaHoffmann
Copy link
Contributor

Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.

Are you experiencing deadlock or does receiver fail to start with an error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants