Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: v0.25.2 no query API server unreachable #5321

Open
bwplotka opened this issue May 2, 2022 · 16 comments
Open

Ruler: v0.25.2 no query API server unreachable #5321

bwplotka opened this issue May 2, 2022 · 16 comments

Comments

@bwplotka
Copy link
Member

bwplotka commented May 2, 2022

One user shared that our Rulers were having hiccups with finding the right Qurier endpoints resulting in gaps:

image

Apparently reverting to v0.24.0 resolved the issue. This seems to be a stateful Ruler.

We will need to have more information e.g:

  • what was reverted - only ruler version or anything else?
  • What's the configuration of the mentioned ruler?
@sharathfeb12
Copy link

The configuration of Thanos Ruler:

- args: - rule - --log.level=debug - --log.format=logfmt - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --objstore.config=$(OBJSTORE_CONFIG) - --data-dir=/thanos/data - --eval-interval=2m - --label=rule_replica="$(NAME)" - --alert.label-drop=rule_replica - --remote-write.config-file=/etc/thanos/conf/rw-config.yaml - --query=dnssrv+_http._tcp.observatorium-thanos-query-frontend.monitoring.svc.cluster.local - --rule-file=/etc/thanos/rules/*/*.yaml

There was no change in the config. Just the version change from v0.25.2 to 0.24.0 fixed the problem.

@stale
Copy link

stale bot commented Jul 31, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 31, 2022
@RohitKochhar
Copy link
Contributor

I am encountering a similar issue since upgrading to v0.28.1. Many rules are failing to be evaluated with ruler with the error no query API server reachable, was this issue ever resolved? @bwplotka @yeya24

@stale stale bot removed the stale label Jan 18, 2023
@daganibhanu
Copy link

I'm seeing the same issue after upgrading to v0.29.0, but couple of findings that I have is,
when we have the targets around 4k+ its working fine, where as if targets were increased to 24k we are running into this error "No query API server reachable"

Additional info from the logs are,

LabelSets: Mint: -62167219200000 Maxt: 9223372036854775807: rpc error: code = Unknown desc = query Prometheus: request failed with code 503 Service Unavailable; msg Service Unavailable\"}

Also, can someone help me in understanding if all rules are being executed simultaneously?

@daganibhanu
Copy link

Hi Team, 5903 as per the suggestion, we have upgraded to 0.29.0, since then we are seeing this issue, is there any workaround or could you please help on how to deal about this issue?
Thanks in advance!

@daganibhanu
Copy link

@bwplotka Can I know if this issue is addressed in version 0.30.0? or any pointers on this issue would be helpful.
Thanks in advance!!

@Cellebyte
Copy link

@bwplotka we have the same problem with 0.30.0 ruler.
We deploy it with the thanosruler crd and use the dnssrv record discovery in kubernetes.

@Cellebyte
Copy link

Cellebyte commented Mar 3, 2023

@bwplotka it looks like that partial_response_strategy needs to be enabled for ruler rules now.
As without that specific flag it is not possible to query with missing stores as it returns errors.

@Migueljfs
Copy link

Hey @Cellebyte I'm having the same issue, could you clarify better how you fixed it?

As per Thanos documentation: "It is recommended to keep partial response as abort for alerts and that is the default as well."

What exactly did you enable and how? I'm using ThanosRuler CRD if that helps

@Cellebyte
Copy link

@Migueljfs you need to set it to partial_response_strategy: "warn" because ruler will fail if one of the storeAPIs of your querier is not reachable or does not answer to the ruler rule request.

@Cellebyte
Copy link

We are covering the problem which is mentioned above by an additional alert which checks if our remote query is reachable by using vector(0) or the up metric for the remote cluster.

@daganibhanu
Copy link

We have identified the issue, in our case looks like issue was with one of the prometheus shard, which has used up all the memory and was not responding, on cleaning up of data, which is removing WAL, head_chunks and TSDB ( it may cause data loss) and bringing up the shards clean, it started working.

@sunilnerella
Copy link

did anyone get a fix for the above issue? I have set partial_response_strategy: "warn" in my rules file but still I get the same error as "no query API server reachable".
Below is the command I have used to bring up my ruler.
/bin/thanos rule --data-dir /var/lib/prometheus-ruler/ --eval-interval 30s --rule-file /etc/prometheus/alert/*.yml --alert.query-url http:/<prom-server-1>:9090 --alertmanagers.url http://localhost:9093 --objstore.config-file /etc/prometheus/bucket.yml --query http://<prom-server-1>:129090 --query http://<prom-server-2>:29090 --label 'monitor_cluster="eu1"' --label 'replica="prom-server101"'

Can someone help with this issue? or any other version of thanos handling this error?

@zbialik
Copy link

zbialik commented Nov 21, 2023

having similar issue running thanos v0.31.0 via ThanosRuler CRD (prometheus operator).

@LukaszWasko
Copy link

LukaszWasko commented Jan 29, 2024

I changed --query value from load balancer (with Thanos Query as a endpoints) to direct Thanos Queries endpoint names. The problem disappeared immediately :)

@lilic
Copy link
Contributor

lilic commented Sep 10, 2024

@bwplotka hey 👋 I ran into this issue today as well. I can normally resolve the thanos query address from within thanos ruler container. I am using v0.29 Thanos version via Prometheus operator as well, the configuration seems to be passed correctly to thanos. Any clues or hints on what might be the issue? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants