Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query does not log undiscoverable store #2404

Closed
belm0 opened this issue Apr 9, 2020 · 14 comments
Closed

Query does not log undiscoverable store #2404

belm0 opened this issue Apr 9, 2020 · 14 comments

Comments

@belm0
Copy link
Contributor

belm0 commented Apr 9, 2020

Thanos, Prometheus and Golang version used: v0.11.0
Object Storage Provider: GCS

What happened:
Specified Store and Sidecar storage endpoints on the command command line using DNS resolver. The Sidecar endpoint was incorrect. Query logged the addition of the Store endpoint, but nothing about the Sidecar endpoint. The Query /stores page shows only the Store endpoint.

        --store.sd-dns-resolver=miekgdns
        --store=dnssrv+_grpc._tcp.thanos-store-grpc.default.svc.cluster.local
        --store=dnssrv+_grpc._tcp.thanos-sidecar-grpc.default.svc.cluster.local

What you expected to happen:
Query logs an error regarding DNS resolution of the bad endpoint. The /stores page provides information about the unresolved endpoint.

@bwplotka
Copy link
Member

bwplotka commented Apr 9, 2020

Agree, good point. 👍 Marking as bug, help wanted to fix it 🤗

@yashrsharma44
Copy link
Contributor

Shall I go about solving this issue?

@bwplotka
Copy link
Member

bwplotka commented Apr 20, 2020 via email

@yashrsharma44
Copy link
Contributor

Great! I will hop on! 😛

@yashrsharma44
Copy link
Contributor

Hey @bwplotka! I was going through the issue, and after reproducing it, I observed the following -

  • The DNS resolution raises an error(here) where it seems to log the error and move on with other dns resolution.
    I was thinking of passing the error to this function where if it detects an error, it logs an error from the Query component.
    What do you think about the approach?

@bwplotka
Copy link
Member

Nice, so if it already logs an error.. what is the problem we are trying to solve then? (:

@yashrsharma44
Copy link
Contributor

So the error raised was from the resolver.go file, where it does the resolution of dns, but somehow that error is not propagated to the query component. So I think we need to propagate the error, as I didn't see any errors raised in the logs of query component.

@bwplotka
Copy link
Member

What do you mean no propagate? There is literally level.Error(p.logger).Log("msg", "dns resolution failed", "addr", addr, "err", err) log line 🤔

@yashrsharma44
Copy link
Contributor

yashrsharma44 commented Apr 25, 2020

Yeah, when I ran the query component in my local machine, it did raise the error, but the same does not happen when I check the logs of query component in Kubernetes deployment.

Let me attach the log.

@bwplotka
Copy link
Member

Nice! Maybe logger is not passed properly?

@yashrsharma44
Copy link
Contributor

I am attaching some info about the investigation that I did 😛

Config details passed to thanos query

thanos query \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:9090 \
--query.replica-label=prometheus_replica \
--query.replica-label=rule_replica \
--store.sd-dns-resolver=miekgdns \
--store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local \
--store=dnssrv+_grpc._tcp.prometheus-3-service.monitoring.svc.cluster.local \

Here is the log -


level=info ts=2020-04-25T09:31:55.41307044Z caller=main.go:152 msg="Tracing will be disabled"
level=info ts=2020-04-25T09:31:55.457862986Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-04-25T09:31:55.458943584Z caller=query.go:401 msg="starting query node"
level=info ts=2020-04-25T09:31:55.45948314Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2020-04-25T09:31:55.460033534Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2020-04-25T09:31:55.460226236Z caller=http.go:56 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2020-04-25T09:31:55.460227986Z caller=grpc.go:106 service=gRPC/server component=query msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2020-04-25T09:33:25.58191222Z caller=storeset.go:384 component=storeset msg="adding new storeAPI to query storeset" address=172.17.0.13:10901 extLset=

And here is the pods that I have deployed in minikube -

 yash@kmaster  kube-prome/prome-thanos  sudo kubectl get po --all-namespaces
NAMESPACE              NAME                                         READY   STATUS    RESTARTS   AGE
kube-system            coredns-66bff467f8-8zw8l                     1/1     Running   14         18h
kube-system            coredns-66bff467f8-sh9v9                     1/1     Running   10         18h
kube-system            etcd-kmaster                                 1/1     Running   5          18h
kube-system            kube-apiserver-kmaster                       1/1     Running   6          18h
kube-system            kube-controller-manager-kmaster              1/1     Running   3          7h26m
kube-system            kube-proxy-8zcsp                             1/1     Running   2          18h
kube-system            kube-scheduler-kmaster                       1/1     Running   4          7h26m
kube-system            storage-provisioner                          1/1     Running   3          18h
kubernetes-dashboard   dashboard-metrics-scraper-84bfdf55ff-8268b   1/1     Running   2          18h
kubernetes-dashboard   kubernetes-dashboard-bc446cc64-cqdsn         1/1     Running   7          18h
monitoring             alertmanager-5f7f948969-jvgbb                1/1     Running   1          7h12m
monitoring             minio-2-7d5765f59c-56299                     1/1     Running   1          7h13m
monitoring             prometheus-0                                 2/2     Running   3          7h11m
monitoring             prometheus-1                                 2/2     Running   3          7h8m
monitoring             prometheus-2                                 2/2     Running   3          7h6m
thanos                 minio-85fd55b9fd-6t2wp                       1/1     Running   0          50s
thanos                 thanos-query-77d797f89d-vj4v8                1/1     Running   0          33s
thanos                 thanos-store-0                               0/1     Running   1          30s

As we can see that prometheus-3-service is not present, the query somehow skips the error.

@yashrsharma44
Copy link
Contributor

Maybe logger is not passed properly?

I think that might be the reason, I am reading through the codebase now, and would comment my understanding of the possible issue 😅

@stale
Copy link

stale bot commented May 25, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 25, 2020
@stale
Copy link

stale bot commented Jun 1, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants