-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use managed prometheus #3984
Comments
@ryanlovett Naive question, Looking at the historical data during the past seven days, it seems the load on Friday was not as high as the load during the previous two days. Do you have any hypotheses on why the load on Friday particularly had an impact? Am I missing something by looking at this data? |
@balajialg I would guess it has something to do with the academic cycle since it is the Friday before RRR week. You could drill down and see if one particular hub had much less usage or if there was a drop across all of them. |
@ryanlovett It was a drop across almost all the major hubs. Ref: https://docs.google.com/document/d/1hw3wR_1Dc40pm7OsZYubzrkk6SD4q3vzA6i89TqStKE/edit?usp=sharing. Getting more curious why we had an outage despite having traffic that was lower than the previous days across multiple hubs. |
So, I created a new nodepool for support, and gave it 48G of RAM, and tried to get list of running users over last 90 days. It failed with a timeout, and actual memory usage never went past 8G. So went back, and looked at the PromQL query itself:
I noticed that in the inner query, there's a So it looked like this now:
and it loads up no problem quickly. Produces 6 month charts too. Max memory usage of the server pod is 15G, which is well within its previous memory limit of 24G. So, I don't think this is a resource problem - I think our promql queries need to be optimized for it to work. |
It's a regex that matches everything, and so just slows the query down. See berkeley-dsep-infra/datahub#3984 (comment)
I opened jupyterhub/grafana-dashboards#50 to remove that one extra regex, but other optimization work needs to happen |
|
https://cloud.google.com/monitoring/uptime-checks is the blackbox uptime checks that can be used to check if prometheus is up, and it can send alerts to people in different methods. I would also suggest checking to see if the prometheus on the new node actually needs that much RAM. I don't think it does, but I'll leave it as is for now. |
Summary
User Stories
We currently deploy prometheus as part of our hub. This is fairly straightforward, however it becomes unavailable whenever something happens to the core nodes. For example today one of the nodes hit a very high load and, if we had an alert for such a condition, it wouldn't have been able to respond because prometheus was taken down by that high load.
Using the managed service would avoid this.
Acceptance criteria
Important information
Tasks to complete
The text was updated successfully, but these errors were encountered: