Automatically watch for errors in our infrastructure logs #2693

yuvipanda · 2021-09-03T11:54:28Z

https://cloud.google.com/error-reporting can help us watch for exceptions in various parts of our infrastructure (like hub, notebook users, etc). We can stop relying 100% on user reports for noticing errors this way.

yuvipanda · 2021-09-03T11:56:02Z

It's already set up actually! https://console.cloud.google.com/errors

yuvipanda · 2021-09-03T11:58:11Z

I don't see the 503s from hub pods there though.

balajialg · 2021-09-07T18:56:39Z

@yuvipanda This is an awesome improvement! I have the following questions/suggestions,

Is there a way to configure this reporting to provide custom messages about the hub that faces the issue? I would like us to get to a stage where anyone in our team can monitor these reports reliably and can send outreach to the teaching team when there is an outage in a specific service/hub.
Can we set up a daily monitoring report from this console that highlights the sanity of the varied hubs/services and reports outages if there are any?
I see information like "Readiness probe failed: Get "http://10.56.5.24:8081/hub/health": context deadline exceeded (Client. Timeout exceeded while awaiting headers)". I suppose that this is not a blocker as we didn't get any issues from our users. However, How do we differentiate these error messages with messages for blockers like 404, 403, etc. I would appreciate it If you could help me parse the difference!

balajialg · 2021-09-08T18:45:54Z

#2601

yuvipanda · 2021-09-09T13:35:40Z

Yeah, I think a lot of the errors here are false positives, and it's not catching most actual errors. A deep dive here is needed to understand this product, see if it's actually helpful to us, and how we can use it. I have never seen that page until I opened this issue, so I don't know much about it yet.

@felder do you think this is something you might be able to tackle?

felder · 2021-09-09T20:15:28Z

@yuvipanda yeah I can take a crack at it

felder · 2021-09-20T22:21:01Z

@yuvipanda glancing at the documentation for the google error handling service, it appears that to get application level errors we would require code changes to send the data to the error reporting service or to call the API.

https://cloud.google.com/error-reporting/docs/setup/kubernetes-engine

felder · 2021-09-21T18:47:07Z

This may prove more fruitful:
https://console.cloud.google.com/logs/query?project=ucb-datahub-2018

felder · 2021-09-21T19:23:30Z

Here's a query that looks for 503s on 9/16 when we last had an issue with #2677

https://cloudlogging.app.goo.gl/yHks4Rcc8FgtzAYu5

Note the peak between 4:30pm and 5:00pm PDT

However there is a lot of noise in here, for example each pod gets a 503 for metrics and requests resulting in a 200 are also logged as errors. If we could get the logging down such that actual errors are reported as type error and the noise wasn't, that would help a lot for isolating errors to alert on.

felder · 2021-09-21T19:29:09Z

https://cloud.google.com/logging/docs/alerting/monitoring-logs

felder · 2021-09-21T19:33:17Z

Looks like I can define log metrics based on queries and then I can setup alerts based on those metrics.

felder · 2021-09-21T22:59:07Z

The following query to Logs Explorer seems to be pointing in a possible good direction. It shows a large spike for data100-prod during the incident on 9/16.

labels."k8s-pod/release"="data100-prod"
textPayload=~"\[E "
textPayload!~"metrics"
labels."k8s-pod/component"="singleuser-server"

felder · 2021-09-21T23:07:21Z

@balajialg requested I do a similar log query for prob140-prod around Sept. 13.

felder · 2021-09-21T23:14:18Z

Similar query again for r-prod for September

felder · 2021-09-21T23:40:56Z

publichealth-prod for the last week, the first spike matches up with a report made on slack of students having trouble, similarly there is a spike last night where another issue was reported.

balajialg · 2021-09-21T23:45:47Z

This is brilliant work, @felder! Thanks for spending the time to figure things out.

yuvipanda · 2021-09-22T09:54:59Z

@felder for alerting, can you explore https://grafana.com/docs/grafana/latest/alerting/ instead? Since all our metrics are already present in grafana...

yuvipanda · 2021-09-22T09:58:50Z

oooh, github just showed me your other messages. We definitely don't have metrics on 503s from user servers - we tried putting it in prometheus (#1993 and #1973) but prometheus was crushed under the weight of those (#1977). So let's explore that and see how it goes.

Thanks for putting time into this, @felder

This gets metrics about requests and response codes from nginx into prometheus, so we can look for 5xx and 4xx errors from it. Note that datahub.berkeley.edu does *not* go through this, but everything else. berkeley-dsep-infra#2167 tracks that. Ref berkeley-dsep-infra#2693

yuvipanda · 2021-09-23T17:34:59Z

With #2792, we're now collecting prometheus metrics on each request that goes through nginx ingress. https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ has some more info. #2693 means that datahub is not represented here yet.

https://grafana.datahub.berkeley.edu/d/IxT315H7z/http-requests-nginx-copy?orgId=1 is a trial dashboard i started playing with that has some of these metrics. I haven't really explored the data. Curious how you think this is vs the google cloud dashboard, @felder? Think you have the time to poke around this too?

balajialg · 2021-09-24T22:23:21Z

@yuvipanda I was looking at this dashboard today. We had more than 100+ 5xx errors consistently over the past 6 hours (9/24). However, there were no 5xx issues were raised by our users which got me wondering whether Prometheus is spitting out error logs alone?

Or the assumption is a certain percentage of our users will always have this kind of error but it never gets reported to us because it is only for a small percentage of users (200/4000 ~5%). I also realized that we may not be able to verify the error logs for previous outages (#2791) the way Jonathan did for GCP as this data is not available as part of grafana error dashboard.

felder · 2021-09-28T22:58:55Z

@yuvipanda yeah can take a look at this, but it'll be after my vacation

yuvipanda · 2021-10-04T20:06:21Z

There's a lot of spurious errors like this:

[W 2021-10-04 18:56:50.738 JupyterHub base:1349] Failing suspected API request to not-running server: /hub/user/USERNAME/api/metrics/v1
[E 2021-10-04 18:56:50.740 JupyterHub log:189] 503 GET /hub/user/USERNAME/api/metrics/v1 (sanjyot_bakshi@10.0.0.19) 5.68ms

This is when their servers have been culled but their browser tab is still open. I think this should be a 404 instead... We should clean that up.

yuvipanda · 2021-10-05T12:42:47Z

Once we get jupyterhub/jupyterhub#3636 deployed, our 503s will have much higher signal to noise ratio. We can use the logs to quickly investigate where they might be and fix them as we find them.

yuvipanda · 2021-10-05T14:39:34Z

We can also access data from stackdriver in our grafana - https://grafana.com/grafana/plugins/stackdriver/. That might be helpful?

balajialg · 2021-10-06T01:38:15Z

@yuvipanda Will Stack Driver push the data that @felder analyzed via GCP console? CC'ing @felder so that he can look into this issue and add his thoughts about the value of this data when he is back.

Brings in jupyterhub/jupyterhub#3639 without bringing in all of master. Should eventually help us clean up the spurious 503s. Ref berkeley-dsep-infra#2693

Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra#2693 (comment)

Enables jupyterhub/jupyterhub#3639. Ref berkeley-dsep-infra#2693

yuvipanda · 2021-10-09T10:52:37Z

Now our background levels of 503 have dropped to 0, and we can treat any spike as something to investigate!!!!!

Note that the main datahub is still not going through this proxy yet - see #2693. I'll try fix that up soon.

Bumps us from 7.x to 8.x, for newer features and (hopefully) better support for pagerduty - I couldn't get the pagerduty alert notification to work with v2 API keys. Ref berkeley-dsep-infra#2693

yuvipanda · 2021-10-18T11:25:01Z

So grafana supports defining alerts (https://grafana.com/docs/grafana/latest/alerting/), and supports many different notification channels. I've signed us up for pagerduty, and invited @felder and @balajialg. We can define alerts on each graph - I've defined a test one for pods that aren't scheduled (https://grafana.datahub.berkeley.edu/d/R3O4mbg7z/cluster-information?orgId=1&viewPanel=7), and it seems to work!

We can pipe these alerts to slack as well.

balajialg · 2021-10-18T18:59:54Z

@yuvipanda, this is an awesome update!

Correct me If I understand the workflow right! For eg: We can configure slack (or any other real-time) notifications whenever the 500 errors increase beyond a certain threshold which is not considered usual. Is this an accurate understanding of one of the use cases?

yuvipanda · 2021-10-18T19:37:43Z

@balajialg yeah that's correct.

balajialg · 2021-10-28T19:14:06Z

@yuvipanda Considering the 5xx dashboard, is this issue done? Anything else to be done pertaining to the scope of the issue?

balajialg · 2021-12-02T23:16:08Z

@yuvipanda Considering that we are tracking pagerduty related alerts through #2988, I am closing this issue. Please feel free to reopen this if you think this is separate/incomplete/requires further attention.

Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra/datahub#2693 (comment)

balajialg added the automation Manual things that shouldn't be label Sep 3, 2021

felder self-assigned this Sep 9, 2021

yuvipanda mentioned this issue Sep 23, 2021

Scrape HTTP request metrics from nginx #2792

Merged

yuvipanda added a commit to yuvipanda/datahub that referenced this issue Oct 6, 2021

Use newer JupyterHub 1.4.x version

e3eb9e0

Brings in jupyterhub/jupyterhub#3639 without bringing in all of master. Should eventually help us clean up the spurious 503s. Ref berkeley-dsep-infra#2693

yuvipanda added a commit to yuvipanda/datahub that referenced this issue Oct 6, 2021

Bump JupyterLab

c5946bb

Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra#2693 (comment)

yuvipanda mentioned this issue Oct 6, 2021

Reduce baseline 503s in our hub infrastructure #2844

Merged

yuvipanda added a commit to yuvipanda/datahub that referenced this issue Oct 9, 2021

Return 424 instead of 503 for non-existent server

072a679

Enables jupyterhub/jupyterhub#3639. Ref berkeley-dsep-infra#2693

yuvipanda mentioned this issue Oct 9, 2021

Return 424 instead of 503 when user servers are stopped #2863

Merged

yuvipanda mentioned this issue Oct 18, 2021

Bump Grafana version #2893

Merged

yuvipanda mentioned this issue Nov 11, 2021

Setup pagerduty for proper use #2988

Open

3 tasks

balajialg closed this as completed Dec 2, 2021

balajialg mentioned this issue Oct 18, 2022

Shane Onboarding: Datahub Infra Access Checklist #3820

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically watch for errors in our infrastructure logs #2693

Automatically watch for errors in our infrastructure logs #2693

yuvipanda commented Sep 3, 2021

yuvipanda commented Sep 3, 2021

yuvipanda commented Sep 3, 2021

balajialg commented Sep 7, 2021 •

edited

Loading

balajialg commented Sep 8, 2021

yuvipanda commented Sep 9, 2021

felder commented Sep 9, 2021

felder commented Sep 20, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021 •

edited

Loading

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

balajialg commented Sep 21, 2021 •

edited

Loading

yuvipanda commented Sep 22, 2021

yuvipanda commented Sep 22, 2021

yuvipanda commented Sep 23, 2021

balajialg commented Sep 24, 2021 •

edited

Loading

felder commented Sep 28, 2021

yuvipanda commented Oct 4, 2021

yuvipanda commented Oct 5, 2021

yuvipanda commented Oct 5, 2021

balajialg commented Oct 6, 2021 •

edited

Loading

yuvipanda commented Oct 9, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 18, 2021 •

edited

Loading

yuvipanda commented Oct 18, 2021

balajialg commented Oct 28, 2021

balajialg commented Dec 2, 2021 •

edited

Loading

Automatically watch for errors in our infrastructure logs #2693

Automatically watch for errors in our infrastructure logs #2693

Comments

yuvipanda commented Sep 3, 2021

yuvipanda commented Sep 3, 2021

yuvipanda commented Sep 3, 2021

balajialg commented Sep 7, 2021 • edited Loading

balajialg commented Sep 8, 2021

yuvipanda commented Sep 9, 2021

felder commented Sep 9, 2021

felder commented Sep 20, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021 • edited Loading

felder commented Sep 21, 2021

felder commented Sep 21, 2021

felder commented Sep 21, 2021

balajialg commented Sep 21, 2021 • edited Loading

yuvipanda commented Sep 22, 2021

yuvipanda commented Sep 22, 2021

yuvipanda commented Sep 23, 2021

balajialg commented Sep 24, 2021 • edited Loading

felder commented Sep 28, 2021

yuvipanda commented Oct 4, 2021

yuvipanda commented Oct 5, 2021

yuvipanda commented Oct 5, 2021

balajialg commented Oct 6, 2021 • edited Loading

yuvipanda commented Oct 9, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 18, 2021 • edited Loading

yuvipanda commented Oct 18, 2021

balajialg commented Oct 28, 2021

balajialg commented Dec 2, 2021 • edited Loading

balajialg commented Sep 7, 2021 •

edited

Loading

felder commented Sep 21, 2021 •

edited

Loading

balajialg commented Sep 21, 2021 •

edited

Loading

balajialg commented Sep 24, 2021 •

edited

Loading

balajialg commented Oct 6, 2021 •

edited

Loading

balajialg commented Oct 18, 2021 •

edited

Loading

balajialg commented Dec 2, 2021 •

edited

Loading