Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: upstream query endpoint reclassified as sidecar unexpectedly #5278

Open
erhudy opened this issue Apr 13, 2022 · 2 comments
Open

Query: upstream query endpoint reclassified as sidecar unexpectedly #5278

erhudy opened this issue Apr 13, 2022 · 2 comments

Comments

@erhudy
Copy link

erhudy commented Apr 13, 2022

Thanos, Prometheus and Golang version used: Thanos v0.25.0, Prometheus v2.34.0

Object Storage Provider: N/A

What happened: I have a hybrid Thanos setup that joins Thanos Query instances in AWS with on-prem Thanos Query instances. Our Grafana setup reads from on-premises Thanos Query, which in turn reads from a combination of on-prem sidecar instances and cloud query instances.

In the specific setup that is being problematic, our non-production env Thanos Query setup is reading from two cloud query instances. One of them is for a QA environment and is only announcing 2 labelsets at present. The other is for a dev environment and is announcing 24 labelsets. Both cloud query instances are constituted of 4 Thanos query replicas running in EKS, fronted by an NLB provisioned by the AWS LB controller. The NLB forwards 10901/TCP through to the replicas.

What you expected to happen:

What happens regularly is that the cloud dev query instance will be unexpectedly reclassified as a sidecar announcing a single labelset. The two on-prem query instances that read from the cloud one do not always agree on this; sometimes one of them shows it as a sidecar while the other one shows it as a query. Occasionally it fixes itself and goes back to being classified as a query endpoint, but more often than that it just gets stuck that way and I have to restart the on-prem instances to get them to see it as a query endpoint again.

How to reproduce it (as minimally and precisely as possible):

I have not worked out what is provoking this to happen yet. My initial suspicion was that the Thanos instances in EKS were being restarted quickly and the NLB would go unhealthy for too long while new target registration was in progress, so I did some work to reduce how often Thanos was getting restarted in EKS, but that doesn't seem to have made a difference so far.

Full logs to relevant components:

I don't have logs at the moment but will post them the next time I see the problem occur.

Anything else we need to know:

@wiardvanrij
Copy link
Member

Hi, thanks for the intro. It would be nice to have more information and perhaps like some examples / logs / screenshots when you get the chance

@stale
Copy link

stale bot commented Sep 21, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants