Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use managed prometheus #3984

Open
ryanlovett opened this issue Dec 3, 2022 · 7 comments
Open

Use managed prometheus #3984

ryanlovett opened this issue Dec 3, 2022 · 7 comments
Assignees
Labels
enhancement Issues around improving existing functionality

Comments

@ryanlovett
Copy link
Collaborator

Summary

User Stories

We currently deploy prometheus as part of our hub. This is fairly straightforward, however it becomes unavailable whenever something happens to the core nodes. For example today one of the nodes hit a very high load and, if we had an alert for such a condition, it wouldn't have been able to respond because prometheus was taken down by that high load.

Using the managed service would avoid this.

Acceptance criteria

  • Use google's service rather than the in-cluster instance.

Important information

Tasks to complete

  • bring up the managed service
  • remove the in-cluster service
@ryanlovett ryanlovett added the enhancement Issues around improving existing functionality label Dec 3, 2022
@balajialg
Copy link
Contributor

balajialg commented Dec 5, 2022

@ryanlovett Naive question, Looking at the historical data during the past seven days, it seems the load on Friday was not as high as the load during the previous two days. Do you have any hypotheses on why the load on Friday particularly had an impact? Am I missing something by looking at this data?

image
image

@ryanlovett
Copy link
Collaborator Author

@balajialg I would guess it has something to do with the academic cycle since it is the Friday before RRR week. You could drill down and see if one particular hub had much less usage or if there was a drop across all of them.

@balajialg
Copy link
Contributor

balajialg commented Dec 6, 2022

@ryanlovett It was a drop across almost all the major hubs. Ref: https://docs.google.com/document/d/1hw3wR_1Dc40pm7OsZYubzrkk6SD4q3vzA6i89TqStKE/edit?usp=sharing. Getting more curious why we had an outage despite having traffic that was lower than the previous days across multiple hubs.

@yuvipanda
Copy link
Contributor

So, I created a new nodepool for support, and gave it 48G of RAM, and tried to get list of running users over last 90 days.

It failed with a timeout, and actual memory usage never went past 8G.

So went back, and looked at the PromQL query itself:

# Sum up all running user pods by namespace
sum(
  # Grab a list of all running pods.
  # The group aggregator always returns "1" for the number of times each
  # unique label appears in the time series. This is desirable for this
  # use case because we're merely identifying running pods by name,
  # not how many times they might be running.
  group(
    kube_pod_status_phase{phase="Running"}
  ) by (pod)
  * on (pod) group_right() group(
    kube_pod_labels{label_app="jupyterhub", label_component="singleuser-server", namespace=~".*"}
  ) by (namespace, pod)
) by (namespace)

I noticed that in the inner query, there's a namespace=~".*" that basically is a no-op as it matches everything but will definitely massively slow everything down as it's a regex. I remove it, and still no luck. So I remove the entire group (which is there to keep out pods that might hang on in 'completed' state or 'pending' state - a temporary and not common occurance.

So it looked like this now:

# Sum up all running user pods by namespace
sum(
    kube_pod_labels{label_app="jupyterhub", label_component="singleuser-server"}
) by (namespace)

and it loads up no problem quickly. Produces 6 month charts too.

image

Max memory usage of the server pod is 15G, which is well within its previous memory limit of 24G.

So, I don't think this is a resource problem - I think our promql queries need to be optimized for it to work.

yuvipanda added a commit to yuvipanda/grafana-dashboards that referenced this issue Dec 8, 2022
It's a regex that matches everything, and so just slows
the query down.

See berkeley-dsep-infra/datahub#3984 (comment)
@yuvipanda
Copy link
Contributor

I opened jupyterhub/grafana-dashboards#50 to remove that one extra regex, but other optimization work needs to happen

@balajialg
Copy link
Contributor

  • Explore black box monitoring feature in Google Operation Suite?

@yuvipanda
Copy link
Contributor

https://cloud.google.com/monitoring/uptime-checks is the blackbox uptime checks that can be used to check if prometheus is up, and it can send alerts to people in different methods.

I would also suggest checking to see if the prometheus on the new node actually needs that much RAM. I don't think it does, but I'll leave it as is for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues around improving existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants