-
Notifications
You must be signed in to change notification settings - Fork 180
epp servicemonitor #1425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
epp servicemonitor #1425
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
I've tested this with the commit cherry-picked onto the @liu-cong ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Regarding the branch, we should target the main branch.
apiVersion: v1 | ||
kind: Secret | ||
metadata: | ||
name: {{ include "gateway-api-inference-extension.name" . }}-token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add namespace here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
apiVersion: monitoring.coreos.com/v1 | ||
kind: ServiceMonitor | ||
metadata: | ||
name: {{ include "gateway-api-inference-extension.name" . }}-monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment, add namespace (I assume this is namespace scoped?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
||
# Monitoring configuration for EPP | ||
monitoring: | ||
# ServiceMonitor configuration for EPP metrics collection with Prometheus Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be different monitoring providers (we definitely want to add GKE here as well), so we should structure it in a more extensible way. There are many fields here that are common to most providers, e.g., the metrics path, selector, etc. So something like:
monitoring:
path: "metrics"
... # other common params
prometheusProvider:
# prometheus specific params
And I would argue to start with the minimal configuration possible to keep it simple. The helm charts are meant for helping with initial setup instead of full configurability (advanced users can always fork and customize however they want). See some guiding principles in the llm-d installation north star https://docs.google.com/document/d/1Y0fJGhELfdXj-Xkznhrl48sDOp_dUvuy5sX4lf9g63o/edit?tab=t.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added monitoring.gke & monitoring.prometheus ptal
For monitoring.gke.enabled - a serviceaccount secret is also created - I believe b4 this, it was up to user to manually create? AFAICT it's the same type of secret for both - the serviceaccount-token secret - in GKE as in other K8s?
path: "/metrics" | ||
interval: "10s" | ||
# scrapeTimeout: "10s" | ||
labels: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we need labels
, annotations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need those, I'll remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
dd7522e
to
8fe1262
Compare
8fe1262
to
0fb68bd
Compare
- interval: {{ .Values.inferenceExtension.monitoring.interval }} | ||
port: {{ .Values.inferenceExtension.monitoring.port }} | ||
path: {{ .Values.inferenceExtension.monitoring.path }} | ||
authorization: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ServiceMonitor is namespace-scoped, does the secret need to reside in the same namespace of the CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK - but I've never run with the secret in another ns so not 100% sure - definitely I'd say best practice though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(namespace is included in the secret template)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT there is no option to add namespace in that authorization.credentials section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that both the CRD and secret use the namespace template .Release.Namespace
so it should be good. Closing this comment.
# Prometheus ServiceMonitor configuration for EPP metrics collection with Prometheus Operator | ||
prometheus: | ||
enabled: false | ||
# scrapeTimeout: "10s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is scrapeTimeout
commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's commented because it's referenced/included in the template but has a default of 10s when not set. There are other fields with defaults in the template that aren't commented out but instead shown as empty objects/arrays, however - to keep values.yaml clean & simple, we could remove these and leave it to users to know how to find the defaulted fields - these are:
{{- with .Values.inferenceExtension.monitoring.prometheus.scrapeTimeout }}
{{- with .Values.inferenceExtension.monitoring.prometheus.relabelings }}
{{- with .Values.inferenceExtension.monitoring.prometheus.metricRelabelings }}
{{- with .Values.inferenceExtension.monitoring.prometheus.selector.matchLabels }}
So, usually, for such defaulted values, I think you'd see in the helm values a commented out value for simple values and an empty array example to show users how to set. So like:
# scrapeTimeout: "10s" # ← commented
relabelings: [] # ← empty array
metricRelabelings: [] # ← empty array
selector:
matchLabels: {} # ← empty object
I'll leave them as/is for now but I'm also not opposed to removing these from the values file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed these options to simplify - can add later if anyone finds they need them
@sallyom thanks for the PR. Please rebase and provide feedback for the review comments when you have a moment. |
0fb68bd
to
2a6520d
Compare
rebased, ty! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We also need to update the document in https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/config/charts/inferencepool
@@ -1,4 +1,4 @@ | |||
{{- if eq (lower .Values.provider.name) "gke" }} | |||
{{- if and (eq (lower .Values.provider.name) "gke") .Values.inferenceExtension.monitoring.gke.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: .Values.inferenceExtension.monitoring.gke.enabled
should be applied to the ClusterPodMonitoring
object below only. The rest is not monitoring specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops - i removed that (and there is no longer a monitoring.gke.enabled flag)
interval: "10s" | ||
scheme: "http" | ||
# port -- Port name to scrape metrics from (must match service port name) | ||
port: "http-metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default port name is metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has to match the service port name, that is currently 'http-metrics' :( but I agree it should be just 'metrics' - can the service port be changed? It's here: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/charts/inferencepool/templates/epp-service.yaml#L15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was out of office in last few days. The port name should be the one on the pod, right? And it's metrics
:
gateway-api-inference-extension/config/charts/inferencepool/templates/epp-deployment.yaml
Line 51 in 8b154ba
- name: metrics |
|
||
# GKE monitoring configuration (ClusterPodMonitoring) | ||
gke: | ||
enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Currently we always create the ClusterPodMonitoring object if GKE is the provider. I don't know a good use where you don't want it in GKE. So I would prefer to keep that behavior, and just remove the enabled
flag for GKE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it!
# Monitoring configuration for EPP | ||
monitoring: | ||
# Common monitoring parameters | ||
path: "/metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to keep the interval
and secret
here since others are unlikely to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed all but those
enabled: false | ||
# scrapeTimeout: "10s" | ||
# relabelings -- RelabelConfigs to apply to samples before scraping | ||
relabelings: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I asked in my initial review but just asking again :) do we need all these configuration or can we just take defaults for the most part? For example, unless we have concrete use cases, I would err on the side of simplicity and just take (opinionated) good default values in the template, instead of allowing too much configurability (the it adds cognitive overhead to the users and we need to document them very well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all the optional configs, it's much simpler now
2a6520d
to
5b33d0a
Compare
a7f1d6e
to
45eef27
Compare
@liu-cong I believe I resolved all feedback, also I added to the chart docs as suggested, PTAL and TY! |
5c5e963
to
b6ef26d
Compare
also, if the epp-service port-name is changed from Deployment port is |
Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
b6ef26d
to
d063ad9
Compare
@liu-cong anything else needed with this? TY |
Good to know, thanks! How is it going to scrape the metrics? Via the load balancing of a k8s service? Will it be able to differentiate the metrics from each pod? |
/lgtm @danehans to approve if this looks good to you as well |
name: {{ .Values.inferenceExtension.monitoring.secret.name }} | ||
key: token | ||
namespace: {{ .Values.gke.monitoringSecret.namespace }} | ||
namespace: {{ .Release.Namespace }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffLuoo this means we need to change the GKE docs since we now have to create a secret under the pool's namespace and we need to create one for each epp deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update in this PR includes a new file config/charts/inferencepool/templates/epp-sa-token-secret.yaml
that is the secret required for scraping the metric. The namespace uses the same template {{ .Release.Namespace }}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor recommendation on default scrape interval.
|
||
# Monitoring configuration for EPP | ||
monitoring: | ||
interval: "10s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Make the default scraping interval 15s
instead of 10s
.
See https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work that the rate
query should have a range four times the scrape interval. The most common range in the query is 1m
, hence setting interval to 15s
will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came across this, which is why I set to 10s - says to set to less than 15s - wdyt?
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/tools/dashboards#troubleshooting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for catching this. I checked the dashboard and we are all using $__rate_interval
. Should be good here to keep it as 10s.
@nirrozenbaum @kfswain are we including this in this release? It was discussed as being part of the v0.3 release of |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, JeffLuoo, sallyom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The released already happened. If this is a blocker for llm-d 0.3, we can patch it and do a minor release. |
* epp servicemonitor and clusterpodmonitor templates Signed-off-by: sallyom <somalley@redhat.com> * add monitoring chart doc Signed-off-by: sallyom <somalley@redhat.com> --------- Signed-off-by: sallyom <somalley@redhat.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
This adds serviceMonitor to scrape metrics from EPP pod.
This PR has been tested with
release-0.5
branch changes (https://github.com/sallyom/gateway-api-inference-extension/tree/add-epp-svcmonitor-0.5)This also adds monitoring.gke & creates the monitoringSecret when enabled.
Which issue(s) this PR fixes:
Does this PR introduce a user-facing change?: