Use managed prometheus #3984

ryanlovett · 2022-12-03T05:05:31Z

Summary

User Stories

We currently deploy prometheus as part of our hub. This is fairly straightforward, however it becomes unavailable whenever something happens to the core nodes. For example today one of the nodes hit a very high load and, if we had an alert for such a condition, it wouldn't have been able to respond because prometheus was taken down by that high load.

Using the managed service would avoid this.

Acceptance criteria

Use google's service rather than the in-cluster instance.

Important information

https://cloud.google.com/stackdriver/docs/managed-prometheus

Tasks to complete

bring up the managed service
remove the in-cluster service

balajialg · 2022-12-05T21:10:31Z

@ryanlovett Naive question, Looking at the historical data during the past seven days, it seems the load on Friday was not as high as the load during the previous two days. Do you have any hypotheses on why the load on Friday particularly had an impact? Am I missing something by looking at this data?

ryanlovett · 2022-12-06T19:00:13Z

@balajialg I would guess it has something to do with the academic cycle since it is the Friday before RRR week. You could drill down and see if one particular hub had much less usage or if there was a drop across all of them.

balajialg · 2022-12-06T21:22:11Z

@ryanlovett It was a drop across almost all the major hubs. Ref: https://docs.google.com/document/d/1hw3wR_1Dc40pm7OsZYubzrkk6SD4q3vzA6i89TqStKE/edit?usp=sharing. Getting more curious why we had an outage despite having traffic that was lower than the previous days across multiple hubs.

yuvipanda · 2022-12-08T18:16:05Z

So, I created a new nodepool for support, and gave it 48G of RAM, and tried to get list of running users over last 90 days.

It failed with a timeout, and actual memory usage never went past 8G.

So went back, and looked at the PromQL query itself:

# Sum up all running user pods by namespace
sum(
  # Grab a list of all running pods.
  # The group aggregator always returns "1" for the number of times each
  # unique label appears in the time series. This is desirable for this
  # use case because we're merely identifying running pods by name,
  # not how many times they might be running.
  group(
    kube_pod_status_phase{phase="Running"}
  ) by (pod)
  * on (pod) group_right() group(
    kube_pod_labels{label_app="jupyterhub", label_component="singleuser-server", namespace=~".*"}
  ) by (namespace, pod)
) by (namespace)

I noticed that in the inner query, there's a namespace=~".*" that basically is a no-op as it matches everything but will definitely massively slow everything down as it's a regex. I remove it, and still no luck. So I remove the entire group (which is there to keep out pods that might hang on in 'completed' state or 'pending' state - a temporary and not common occurance.

So it looked like this now:

# Sum up all running user pods by namespace
sum(
    kube_pod_labels{label_app="jupyterhub", label_component="singleuser-server"}
) by (namespace)

and it loads up no problem quickly. Produces 6 month charts too.

Max memory usage of the server pod is 15G, which is well within its previous memory limit of 24G.

So, I don't think this is a resource problem - I think our promql queries need to be optimized for it to work.

It's a regex that matches everything, and so just slows the query down. See berkeley-dsep-infra/datahub#3984 (comment)

yuvipanda · 2022-12-08T18:55:43Z

I opened jupyterhub/grafana-dashboards#50 to remove that one extra regex, but other optimization work needs to happen

balajialg · 2022-12-12T19:19:23Z

Explore black box monitoring feature in Google Operation Suite?

yuvipanda · 2022-12-12T19:44:22Z

https://cloud.google.com/monitoring/uptime-checks is the blackbox uptime checks that can be used to check if prometheus is up, and it can send alerts to people in different methods.

I would also suggest checking to see if the prometheus on the new node actually needs that much RAM. I don't think it does, but I'll leave it as is for now.

ryanlovett assigned balajialg and shaneknapp Dec 3, 2022

ryanlovett added the enhancement Issues around improving existing functionality label Dec 3, 2022

yuvipanda mentioned this issue Dec 8, 2022

Move all support stuff except nginx to support pool #3995

Merged

yuvipanda added a commit to yuvipanda/grafana-dashboards that referenced this issue Dec 8, 2022

Remove unnecessary term in query

6394093

It's a regex that matches everything, and so just slows the query down. See berkeley-dsep-infra/datahub#3984 (comment)

yuvipanda mentioned this issue Dec 8, 2022

Remove unnecessary term in query jupyterhub/grafana-dashboards#50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use managed prometheus #3984

Use managed prometheus #3984

ryanlovett commented Dec 3, 2022

balajialg commented Dec 5, 2022 •

edited

Loading

ryanlovett commented Dec 6, 2022

balajialg commented Dec 6, 2022 •

edited

Loading

yuvipanda commented Dec 8, 2022

yuvipanda commented Dec 8, 2022

balajialg commented Dec 12, 2022

yuvipanda commented Dec 12, 2022

Use managed prometheus #3984

Use managed prometheus #3984

Comments

ryanlovett commented Dec 3, 2022

Summary

User Stories

Acceptance criteria

Important information

Tasks to complete

balajialg commented Dec 5, 2022 • edited Loading

ryanlovett commented Dec 6, 2022

balajialg commented Dec 6, 2022 • edited Loading

yuvipanda commented Dec 8, 2022

yuvipanda commented Dec 8, 2022

balajialg commented Dec 12, 2022

yuvipanda commented Dec 12, 2022

balajialg commented Dec 5, 2022 •

edited

Loading

balajialg commented Dec 6, 2022 •

edited

Loading