-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically watch for errors in our infrastructure logs #2693
Comments
It's already set up actually! https://console.cloud.google.com/errors |
I don't see the 503s from hub pods there though. |
@yuvipanda This is an awesome improvement! I have the following questions/suggestions,
|
Yeah, I think a lot of the errors here are false positives, and it's not catching most actual errors. A deep dive here is needed to understand this product, see if it's actually helpful to us, and how we can use it. I have never seen that page until I opened this issue, so I don't know much about it yet. @felder do you think this is something you might be able to tackle? |
@yuvipanda yeah I can take a crack at it |
@yuvipanda glancing at the documentation for the google error handling service, it appears that to get application level errors we would require code changes to send the data to the error reporting service or to call the API. https://cloud.google.com/error-reporting/docs/setup/kubernetes-engine |
This may prove more fruitful: |
Here's a query that looks for 503s on 9/16 when we last had an issue with #2677 https://cloudlogging.app.goo.gl/yHks4Rcc8FgtzAYu5 Note the peak between 4:30pm and 5:00pm PDT However there is a lot of noise in here, for example each pod gets a 503 for metrics and requests resulting in a 200 are also logged as errors. If we could get the logging down such that actual errors are reported as type error and the noise wasn't, that would help a lot for isolating errors to alert on. |
Looks like I can define log metrics based on queries and then I can setup alerts based on those metrics. |
@balajialg requested I do a similar log query for prob140-prod around Sept. 13. |
This is brilliant work, @felder! Thanks for spending the time to figure things out. |
@felder for alerting, can you explore https://grafana.com/docs/grafana/latest/alerting/ instead? Since all our metrics are already present in grafana... |
oooh, github just showed me your other messages. We definitely don't have metrics on 503s from user servers - we tried putting it in prometheus (#1993 and #1973) but prometheus was crushed under the weight of those (#1977). So let's explore that and see how it goes. Thanks for putting time into this, @felder |
This gets metrics about requests and response codes from nginx into prometheus, so we can look for 5xx and 4xx errors from it. Note that datahub.berkeley.edu does *not* go through this, but everything else. berkeley-dsep-infra#2167 tracks that. Ref berkeley-dsep-infra#2693
With #2792, we're now collecting prometheus metrics on each request that goes through nginx ingress. https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ has some more info. #2693 means that datahub is not represented here yet. https://grafana.datahub.berkeley.edu/d/IxT315H7z/http-requests-nginx-copy?orgId=1 is a trial dashboard i started playing with that has some of these metrics. I haven't really explored the data. Curious how you think this is vs the google cloud dashboard, @felder? Think you have the time to poke around this too? |
@yuvipanda I was looking at this dashboard today. We had more than 100+ 5xx errors consistently over the past 6 hours (9/24). However, there were no 5xx issues were raised by our users which got me wondering whether Prometheus is spitting out error logs alone? Or the assumption is a certain percentage of our users will always have this kind of error but it never gets reported to us because it is only for a small percentage of users (200/4000 ~5%). I also realized that we may not be able to verify the error logs for previous outages (#2791) the way Jonathan did for GCP as this data is not available as part of grafana error dashboard. |
@yuvipanda yeah can take a look at this, but it'll be after my vacation |
There's a lot of spurious errors like this:
This is when their servers have been culled but their browser tab is still open. I think this should be a 404 instead... We should clean that up. |
Once we get jupyterhub/jupyterhub#3636 deployed, our 503s will have much higher signal to noise ratio. We can use the logs to quickly investigate where they might be and fix them as we find them. |
We can also access data from stackdriver in our grafana - https://grafana.com/grafana/plugins/stackdriver/. That might be helpful? |
@yuvipanda Will Stack Driver push the data that @felder analyzed via GCP console? CC'ing @felder so that he can look into this issue and add his thoughts about the value of this data when he is back. |
Brings in jupyterhub/jupyterhub#3639 without bringing in all of master. Should eventually help us clean up the spurious 503s. Ref berkeley-dsep-infra#2693
Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra#2693 (comment)
Now our background levels of 503 have dropped to 0, and we can treat any spike as something to investigate!!!!! Note that the main datahub is still not going through this proxy yet - see #2693. I'll try fix that up soon. |
Bumps us from 7.x to 8.x, for newer features and (hopefully) better support for pagerduty - I couldn't get the pagerduty alert notification to work with v2 API keys. Ref berkeley-dsep-infra#2693
So grafana supports defining alerts (https://grafana.com/docs/grafana/latest/alerting/), and supports many different notification channels. I've signed us up for pagerduty, and invited @felder and @balajialg. We can define alerts on each graph - I've defined a test one for pods that aren't scheduled (https://grafana.datahub.berkeley.edu/d/R3O4mbg7z/cluster-information?orgId=1&viewPanel=7), and it seems to work! We can pipe these alerts to slack as well. |
@yuvipanda, this is an awesome update! Correct me If I understand the workflow right! For eg: We can configure slack (or any other real-time) notifications whenever the 500 errors increase beyond a certain threshold which is not considered usual. Is this an accurate understanding of one of the use cases? |
@balajialg yeah that's correct. |
@yuvipanda Considering the 5xx dashboard, is this issue done? Anything else to be done pertaining to the scope of the issue? |
@yuvipanda Considering that we are tracking pagerduty related alerts through #2988, I am closing this issue. Please feel free to reopen this if you think this is separate/incomplete/requires further attention. |
Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra/datahub#2693 (comment)
Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra/datahub#2693 (comment)
Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra/datahub#2693 (comment)
Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra/datahub#2693 (comment)
Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra/datahub#2693 (comment)
Brings in jupyterlab/jupyterlab#11205, to support us bringing in jupyterhub/jupyterhub#3636 so we can have cleaner 503 error graphs. Ref berkeley-dsep-infra/datahub#2693 (comment)
https://cloud.google.com/error-reporting can help us watch for exceptions in various parts of our infrastructure (like hub, notebook users, etc). We can stop relying 100% on user reports for noticing errors this way.
The text was updated successfully, but these errors were encountered: