Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically watch for errors in our infrastructure logs #2693

Closed
yuvipanda opened this issue Sep 3, 2021 · 33 comments
Closed

Automatically watch for errors in our infrastructure logs #2693

yuvipanda opened this issue Sep 3, 2021 · 33 comments
Assignees
Labels
automation Manual things that shouldn't be

Comments

@yuvipanda
Copy link
Contributor

https://cloud.google.com/error-reporting can help us watch for exceptions in various parts of our infrastructure (like hub, notebook users, etc). We can stop relying 100% on user reports for noticing errors this way.

@yuvipanda
Copy link
Contributor Author

It's already set up actually! https://console.cloud.google.com/errors

@yuvipanda
Copy link
Contributor Author

I don't see the 503s from hub pods there though.

@balajialg balajialg added the automation Manual things that shouldn't be label Sep 3, 2021
@balajialg
Copy link
Contributor

balajialg commented Sep 7, 2021

@yuvipanda This is an awesome improvement! I have the following questions/suggestions,

  • Is there a way to configure this reporting to provide custom messages about the hub that faces the issue? I would like us to get to a stage where anyone in our team can monitor these reports reliably and can send outreach to the teaching team when there is an outage in a specific service/hub.
  • Can we set up a daily monitoring report from this console that highlights the sanity of the varied hubs/services and reports outages if there are any?
  • I see information like "Readiness probe failed: Get "http://10.56.5.24:8081/hub/health": context deadline exceeded (Client. Timeout exceeded while awaiting headers)". I suppose that this is not a blocker as we didn't get any issues from our users. However, How do we differentiate these error messages with messages for blockers like 404, 403, etc. I would appreciate it If you could help me parse the difference!

@balajialg
Copy link
Contributor

#2601

@yuvipanda
Copy link
Contributor Author

Yeah, I think a lot of the errors here are false positives, and it's not catching most actual errors. A deep dive here is needed to understand this product, see if it's actually helpful to us, and how we can use it. I have never seen that page until I opened this issue, so I don't know much about it yet.

@felder do you think this is something you might be able to tackle?

@felder felder self-assigned this Sep 9, 2021
@felder
Copy link
Contributor

felder commented Sep 9, 2021

@yuvipanda yeah I can take a crack at it

@felder
Copy link
Contributor

felder commented Sep 20, 2021

@yuvipanda glancing at the documentation for the google error handling service, it appears that to get application level errors we would require code changes to send the data to the error reporting service or to call the API.

https://cloud.google.com/error-reporting/docs/setup/kubernetes-engine

@felder
Copy link
Contributor

felder commented Sep 21, 2021

This may prove more fruitful:
https://console.cloud.google.com/logs/query?project=ucb-datahub-2018

@felder
Copy link
Contributor

felder commented Sep 21, 2021

Here's a query that looks for 503s on 9/16 when we last had an issue with #2677

https://cloudlogging.app.goo.gl/yHks4Rcc8FgtzAYu5

Note the peak between 4:30pm and 5:00pm PDT

However there is a lot of noise in here, for example each pod gets a 503 for metrics and requests resulting in a 200 are also logged as errors. If we could get the logging down such that actual errors are reported as type error and the noise wasn't, that would help a lot for isolating errors to alert on.

@felder
Copy link
Contributor

felder commented Sep 21, 2021

@felder
Copy link
Contributor

felder commented Sep 21, 2021

Looks like I can define log metrics based on queries and then I can setup alerts based on those metrics.

@felder
Copy link
Contributor

felder commented Sep 21, 2021

The following query to Logs Explorer seems to be pointing in a possible good direction. It shows a large spike for data100-prod during the incident on 9/16.

labels."k8s-pod/release"="data100-prod"
textPayload=~"\[E "
textPayload!~"metrics"
labels."k8s-pod/component"="singleuser-server"

Screen Shot 2021-09-21 at 3 58 36 PM

@felder
Copy link
Contributor

felder commented Sep 21, 2021

@balajialg requested I do a similar log query for prob140-prod around Sept. 13.

Screen Shot 2021-09-21 at 4 04 33 PM

@felder
Copy link
Contributor

felder commented Sep 21, 2021

Similar query again for r-prod for September

Screen Shot 2021-09-21 at 4 13 28 PM

@felder
Copy link
Contributor

felder commented Sep 21, 2021

publichealth-prod for the last week, the first spike matches up with a report made on slack of students having trouble, similarly there is a spike last night where another issue was reported.

Screen Shot 2021-09-21 at 4 39 14 PM

@balajialg
Copy link
Contributor

balajialg commented Sep 21, 2021

This is brilliant work, @felder! Thanks for spending the time to figure things out.

@yuvipanda
Copy link
Contributor Author

@felder for alerting, can you explore https://grafana.com/docs/grafana/latest/alerting/ instead? Since all our metrics are already present in grafana...

@yuvipanda
Copy link
Contributor Author

oooh, github just showed me your other messages. We definitely don't have metrics on 503s from user servers - we tried putting it in prometheus (#1993 and #1973) but prometheus was crushed under the weight of those (#1977). So let's explore that and see how it goes.

Thanks for putting time into this, @felder

yuvipanda added a commit to yuvipanda/datahub that referenced this issue Sep 23, 2021
This gets metrics about requests and response codes
from nginx into prometheus, so we can look for 5xx and
4xx errors from it.

Note that datahub.berkeley.edu does *not* go through this,
but everything else. berkeley-dsep-infra#2167
tracks that.

Ref berkeley-dsep-infra#2693
@yuvipanda
Copy link
Contributor Author

With #2792, we're now collecting prometheus metrics on each request that goes through nginx ingress. https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ has some more info. #2693 means that datahub is not represented here yet.

https://grafana.datahub.berkeley.edu/d/IxT315H7z/http-requests-nginx-copy?orgId=1 is a trial dashboard i started playing with that has some of these metrics. I haven't really explored the data. Curious how you think this is vs the google cloud dashboard, @felder? Think you have the time to poke around this too?

@balajialg
Copy link
Contributor

balajialg commented Sep 24, 2021

@yuvipanda I was looking at this dashboard today. We had more than 100+ 5xx errors consistently over the past 6 hours (9/24). However, there were no 5xx issues were raised by our users which got me wondering whether Prometheus is spitting out error logs alone?

image

Or the assumption is a certain percentage of our users will always have this kind of error but it never gets reported to us because it is only for a small percentage of users (200/4000 ~5%). I also realized that we may not be able to verify the error logs for previous outages (#2791) the way Jonathan did for GCP as this data is not available as part of grafana error dashboard.

@felder
Copy link
Contributor

felder commented Sep 28, 2021

@yuvipanda yeah can take a look at this, but it'll be after my vacation

@yuvipanda
Copy link
Contributor Author

There's a lot of spurious errors like this:

[W 2021-10-04 18:56:50.738 JupyterHub base:1349] Failing suspected API request to not-running server: /hub/user/USERNAME/api/metrics/v1
[E 2021-10-04 18:56:50.740 JupyterHub log:189] 503 GET /hub/user/USERNAME/api/metrics/v1 (sanjyot_bakshi@10.0.0.19) 5.68ms

This is when their servers have been culled but their browser tab is still open. I think this should be a 404 instead... We should clean that up.

@yuvipanda
Copy link
Contributor Author

Once we get jupyterhub/jupyterhub#3636 deployed, our 503s will have much higher signal to noise ratio. We can use the logs to quickly investigate where they might be and fix them as we find them.

@yuvipanda
Copy link
Contributor Author

We can also access data from stackdriver in our grafana - https://grafana.com/grafana/plugins/stackdriver/. That might be helpful?

@balajialg
Copy link
Contributor

balajialg commented Oct 6, 2021

@yuvipanda Will Stack Driver push the data that @felder analyzed via GCP console? CC'ing @felder so that he can look into this issue and add his thoughts about the value of this data when he is back.

yuvipanda added a commit to yuvipanda/datahub that referenced this issue Oct 6, 2021
Brings in jupyterhub/jupyterhub#3639
without bringing in all of master.

Should eventually help us clean up the spurious 503s.

Ref berkeley-dsep-infra#2693
yuvipanda added a commit to yuvipanda/datahub that referenced this issue Oct 6, 2021
Brings in jupyterlab/jupyterlab#11205,
to support us bringing in jupyterhub/jupyterhub#3636
so we can have cleaner 503 error graphs.

Ref berkeley-dsep-infra#2693 (comment)
@yuvipanda
Copy link
Contributor Author

image

Now our background levels of 503 have dropped to 0, and we can treat any spike as something to investigate!!!!!

Note that the main datahub is still not going through this proxy yet - see #2693. I'll try fix that up soon.

yuvipanda added a commit to yuvipanda/datahub that referenced this issue Oct 18, 2021
Bumps us from 7.x to 8.x, for newer features and (hopefully)
better support for pagerduty - I couldn't get the pagerduty
alert notification to work with v2 API keys.

Ref berkeley-dsep-infra#2693
@yuvipanda
Copy link
Contributor Author

So grafana supports defining alerts (https://grafana.com/docs/grafana/latest/alerting/), and supports many different notification channels. I've signed us up for pagerduty, and invited @felder and @balajialg. We can define alerts on each graph - I've defined a test one for pods that aren't scheduled (https://grafana.datahub.berkeley.edu/d/R3O4mbg7z/cluster-information?orgId=1&viewPanel=7), and it seems to work!

We can pipe these alerts to slack as well.

@balajialg
Copy link
Contributor

balajialg commented Oct 18, 2021

@yuvipanda, this is an awesome update!

Correct me If I understand the workflow right! For eg: We can configure slack (or any other real-time) notifications whenever the 500 errors increase beyond a certain threshold which is not considered usual. Is this an accurate understanding of one of the use cases?

@yuvipanda
Copy link
Contributor Author

@balajialg yeah that's correct.

@balajialg
Copy link
Contributor

@yuvipanda Considering the 5xx dashboard, is this issue done? Anything else to be done pertaining to the scope of the issue?

@balajialg
Copy link
Contributor

balajialg commented Dec 2, 2021

@yuvipanda Considering that we are tracking pagerduty related alerts through #2988, I am closing this issue. Please feel free to reopen this if you think this is separate/incomplete/requires further attention.

shaneknapp pushed a commit to berkeley-dsep-infra/biology-user-image that referenced this issue Sep 6, 2024
Brings in jupyterlab/jupyterlab#11205,
to support us bringing in jupyterhub/jupyterhub#3636
so we can have cleaner 503 error graphs.

Ref berkeley-dsep-infra/datahub#2693 (comment)
shaneknapp pushed a commit to berkeley-dsep-infra/publichealth-user-image that referenced this issue Sep 11, 2024
Brings in jupyterlab/jupyterlab#11205,
to support us bringing in jupyterhub/jupyterhub#3636
so we can have cleaner 503 error graphs.

Ref berkeley-dsep-infra/datahub#2693 (comment)
shaneknapp pushed a commit to berkeley-dsep-infra/eecs-user-image that referenced this issue Sep 11, 2024
Brings in jupyterlab/jupyterlab#11205,
to support us bringing in jupyterhub/jupyterhub#3636
so we can have cleaner 503 error graphs.

Ref berkeley-dsep-infra/datahub#2693 (comment)
shaneknapp pushed a commit to berkeley-dsep-infra/julia-user-image that referenced this issue Sep 12, 2024
Brings in jupyterlab/jupyterlab#11205,
to support us bringing in jupyterhub/jupyterhub#3636
so we can have cleaner 503 error graphs.

Ref berkeley-dsep-infra/datahub#2693 (comment)
shaneknapp pushed a commit to berkeley-dsep-infra/ischool-user-image that referenced this issue Sep 24, 2024
Brings in jupyterlab/jupyterlab#11205,
to support us bringing in jupyterhub/jupyterhub#3636
so we can have cleaner 503 error graphs.

Ref berkeley-dsep-infra/datahub#2693 (comment)
shaneknapp pushed a commit to berkeley-dsep-infra/datahub-user-image that referenced this issue Sep 25, 2024
Brings in jupyterlab/jupyterlab#11205,
to support us bringing in jupyterhub/jupyterhub#3636
so we can have cleaner 503 error graphs.

Ref berkeley-dsep-infra/datahub#2693 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation Manual things that shouldn't be
Projects
None yet
Development

No branches or pull requests

3 participants