epp servicemonitor #1425

sallyom · 2025-08-21T04:30:25Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

This adds serviceMonitor to scrape metrics from EPP pod.
This PR has been tested with release-0.5 branch changes (https://github.com/sallyom/gateway-api-inference-extension/tree/add-epp-svcmonitor-0.5)
This also adds monitoring.gke & creates the monitoringSecret when enabled.

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:

Helm charts can enable scraping metrics from EPP pod. Set inferenceExtension.monitoring.prometheus.enabled to create a ServiceMonitor that matches with EPP service. For GKE environments, monitoring is automatically configured when `provider.name` is set to `gke`

netlify · 2025-08-21T04:31:11Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`d063ad9`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68bf449eeea971000899dc3d
😎 Deploy Preview	https://deploy-preview-1425--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

sallyom · 2025-08-22T16:32:21Z

I've tested this with the commit cherry-picked onto the release-0.5 branch - where should this PR go? Against that branch or here, against main? I'm running with llm-d that brings in release-0.5 - so EPP deployment uses camelCase flags rather then hyphenated-flags Please advise, ty.

@liu-cong ptal

liu-cong

Thanks! Regarding the branch, we should target the main branch.

liu-cong · 2025-08-22T23:10:06Z

config/charts/inferencepool/templates/epp-sa-token-secret.yaml

+apiVersion: v1
+kind: Secret
+metadata:
+  name: {{ include "gateway-api-inference-extension.name" . }}-token


can we add namespace here as well?

liu-cong · 2025-08-22T23:10:32Z

config/charts/inferencepool/templates/epp-servicemonitor.yaml

+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: {{ include "gateway-api-inference-extension.name" . }}-monitor


same comment, add namespace (I assume this is namespace scoped?)

liu-cong · 2025-08-22T23:19:18Z

config/charts/inferencepool/values.yaml


+  # Monitoring configuration for EPP
+  monitoring:
+    # ServiceMonitor configuration for EPP metrics collection with Prometheus Operator


There can be different monitoring providers (we definitely want to add GKE here as well), so we should structure it in a more extensible way. There are many fields here that are common to most providers, e.g., the metrics path, selector, etc. So something like:

monitoring: path: "metrics" ... # other common params prometheusProvider: # prometheus specific params

And I would argue to start with the minimal configuration possible to keep it simple. The helm charts are meant for helping with initial setup instead of full configurability (advanced users can always fork and customize however they want). See some guiding principles in the llm-d installation north star https://docs.google.com/document/d/1Y0fJGhELfdXj-Xkznhrl48sDOp_dUvuy5sX4lf9g63o/edit?tab=t.0

I added monitoring.gke & monitoring.prometheus ptal
For monitoring.gke.enabled - a serviceaccount secret is also created - I believe b4 this, it was up to user to manually create? AFAICT it's the same type of secret for both - the serviceaccount-token secret - in GKE as in other K8s?

liu-cong · 2025-08-22T23:23:44Z

config/charts/inferencepool/values.yaml

+      path: "/metrics"
+      interval: "10s"
+      # scrapeTimeout: "10s"
+      labels: {}


Where do we need labels, annotations?

we don't need those, I'll remove

JeffLuoo · 2025-08-26T14:02:28Z

config/charts/inferencepool/templates/epp-servicemonitor.yaml

+  - interval: {{ .Values.inferenceExtension.monitoring.interval }}
+    port: {{ .Values.inferenceExtension.monitoring.port }}
+    path: {{ .Values.inferenceExtension.monitoring.path }}
+    authorization:


If ServiceMonitor is namespace-scoped, does the secret need to reside in the same namespace of the CR?

AFAIK - but I've never run with the secret in another ns so not 100% sure - definitely I'd say best practice though.

(namespace is included in the secret template)

AFAICT there is no option to add namespace in that authorization.credentials section.

I see that both the CRD and secret use the namespace template .Release.Namespace so it should be good. Closing this comment.

danehans · 2025-09-02T21:35:25Z

config/charts/inferencepool/values.yaml

+    # Prometheus ServiceMonitor configuration for EPP metrics collection with Prometheus Operator
+    prometheus:
+      enabled: false
+      # scrapeTimeout: "10s"


Why is scrapeTimeout commented out?

it's commented because it's referenced/included in the template but has a default of 10s when not set. There are other fields with defaults in the template that aren't commented out but instead shown as empty objects/arrays, however - to keep values.yaml clean & simple, we could remove these and leave it to users to know how to find the defaulted fields - these are:

{{- with .Values.inferenceExtension.monitoring.prometheus.scrapeTimeout }} {{- with .Values.inferenceExtension.monitoring.prometheus.relabelings }} {{- with .Values.inferenceExtension.monitoring.prometheus.metricRelabelings }} {{- with .Values.inferenceExtension.monitoring.prometheus.selector.matchLabels }}

So, usually, for such defaulted values, I think you'd see in the helm values a commented out value for simple values and an empty array example to show users how to set. So like:

# scrapeTimeout: "10s" # ← commented relabelings: [] # ← empty array metricRelabelings: [] # ← empty array selector: matchLabels: {} # ← empty object

I'll leave them as/is for now but I'm also not opposed to removing these from the values file.

I've removed these options to simplify - can add later if anyone finds they need them

danehans · 2025-09-02T21:37:26Z

@sallyom thanks for the PR. Please rebase and provide feedback for the review comments when you have a moment.

sallyom · 2025-09-03T02:26:45Z

rebased, ty!

liu-cong

Thanks! We also need to update the document in https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/config/charts/inferencepool

liu-cong · 2025-09-03T18:45:18Z

config/charts/inferencepool/templates/gke.yaml

@@ -1,4 +1,4 @@
-{{- if eq (lower .Values.provider.name) "gke" }}
+{{- if and (eq (lower .Values.provider.name) "gke") .Values.inferenceExtension.monitoring.gke.enabled }}


NOTE: .Values.inferenceExtension.monitoring.gke.enabled should be applied to the ClusterPodMonitoring object below only. The rest is not monitoring specific.

oops - i removed that (and there is no longer a monitoring.gke.enabled flag)

liu-cong · 2025-09-03T18:46:09Z

config/charts/inferencepool/values.yaml

+    interval: "10s"
+    scheme: "http"
+    # port -- Port name to scrape metrics from (must match service port name)
+    port: "http-metrics"


I think the default port name is metrics

it has to match the service port name, that is currently 'http-metrics' :( but I agree it should be just 'metrics' - can the service port be changed? It's here: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/charts/inferencepool/templates/epp-service.yaml#L15

Sorry I was out of office in last few days. The port name should be the one on the pod, right? And it's metrics:

gateway-api-inference-extension/config/charts/inferencepool/templates/epp-deployment.yaml

Line 51 in 8b154ba

- name: metrics

liu-cong · 2025-09-03T18:50:07Z

config/charts/inferencepool/values.yaml

+
+    # GKE monitoring configuration (ClusterPodMonitoring)
+    gke:
+      enabled: false


NOTE: Currently we always create the ClusterPodMonitoring object if GKE is the provider. I don't know a good use where you don't want it in GKE. So I would prefer to keep that behavior, and just remove the enabled flag for GKE.

removed it!

liu-cong · 2025-09-03T18:54:17Z

config/charts/inferencepool/values.yaml

+  # Monitoring configuration for EPP
+  monitoring:
+    # Common monitoring parameters
+    path: "/metrics"


I think we just need to keep the interval and secret here since others are unlikely to change.

removed all but those

liu-cong · 2025-09-03T18:57:42Z

config/charts/inferencepool/values.yaml

+      enabled: false
+      # scrapeTimeout: "10s"
+      # relabelings -- RelabelConfigs to apply to samples before scraping
+      relabelings: []


I think I asked in my initial review but just asking again :) do we need all these configuration or can we just take defaults for the most part? For example, unless we have concrete use cases, I would err on the side of simplicity and just take (opinionated) good default values in the template, instead of allowing too much configurability (the it adds cognitive overhead to the users and we need to document them very well).

I removed all the optional configs, it's much simpler now

sallyom · 2025-09-05T14:34:48Z

@liu-cong I believe I resolved all feedback, also I added to the chart docs as suggested, PTAL and TY!

sallyom · 2025-09-05T14:39:37Z

also, if the epp-service port-name is changed from http-metrics to metrics that can be a follow-up? not sure if there is any concern of backward compatibility with that

Deployment port is metrics
Service port is http-metrics that's what ServiceMonitor matches with.

Signed-off-by: sallyom <somalley@redhat.com>

sallyom · 2025-09-09T15:49:17Z

@liu-cong anything else needed with this? TY

liu-cong · 2025-09-09T16:04:16Z

@liu-cong anything else needed with this? TY

@sallyom Can you address this question? Otherwise this lgtm. Thnaks!

sallyom · 2025-09-10T16:30:50Z

@liu-cong anything else needed with this? TY

@sallyom Can you address this question? Otherwise this lgtm. Thnaks!

for ServiceMonitors, the service port name is used, not the pod port name. Ideally, pod port name matches service port name but not the case here.

liu-cong · 2025-09-10T16:37:39Z

for ServiceMonitors, the service port name is used, not the pod port name. Ideally, pod port name matches service port name but not the case here.

Good to know, thanks! How is it going to scrape the metrics? Via the load balancing of a k8s service? Will it be able to differentiate the metrics from each pod?

liu-cong · 2025-09-10T16:37:59Z

/lgtm

@danehans to approve if this looks good to you as well

ahg-g · 2025-09-10T17:01:24Z

config/charts/inferencepool/templates/gke.yaml

+          name: {{ .Values.inferenceExtension.monitoring.secret.name }}
          key: token
-          namespace: {{ .Values.gke.monitoringSecret.namespace }}
+          namespace: {{ .Release.Namespace }}


@JeffLuoo this means we need to change the GKE docs since we now have to create a secret under the pool's namespace and we need to create one for each epp deployment.

The update in this PR includes a new file config/charts/inferencepool/templates/epp-sa-token-secret.yaml that is the secret required for scraping the metric. The namespace uses the same template {{ .Release.Namespace }}.

JeffLuoo

LGTM, just a minor recommendation on default scrape interval.

JeffLuoo · 2025-09-10T17:26:03Z

config/charts/inferencepool/values.yaml


+  # Monitoring configuration for EPP
+  monitoring:
+    interval: "10s"


nit: Make the default scraping interval 15s instead of 10s.

See https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work that the rate query should have a range four times the scrape interval. The most common range in the query is 1m, hence setting interval to 15s will be better.

I came across this, which is why I set to 10s - says to set to less than 15s - wdyt?
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/tools/dashboards#troubleshooting

Ah thanks for catching this. I checked the dashboard and we are all using $__rate_interval. Should be good here to keep it as 10s.

Gregory-Pereira · 2025-09-11T14:43:47Z

@nirrozenbaum @kfswain are we including this in this release? It was discussed as being part of the v0.3 release of llm-d I just want to check that is still our current understanding

ahg-g · 2025-09-12T00:46:06Z

/approve
/lgtm

k8s-ci-robot · 2025-09-12T00:46:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, JeffLuoo, sallyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2025-09-12T00:47:11Z

@nirrozenbaum @kfswain are we including this in this release? It was discussed as being part of the v0.3 release of llm-d I just want to check that is still our current understanding

The released already happened. If this is a blocker for llm-d 0.3, we can patch it and do a minor release.

* epp servicemonitor and clusterpodmonitor templates Signed-off-by: sallyom <somalley@redhat.com> * add monitoring chart doc Signed-off-by: sallyom <somalley@redhat.com> --------- Signed-off-by: sallyom <somalley@redhat.com>

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 21, 2025

k8s-ci-robot requested review from ahg-g and robscott August 21, 2025 04:30

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 21, 2025

sallyom mentioned this pull request Aug 21, 2025

add epp servicemonitor llm-d-incubation/llm-d-modelservice#87

Closed

sallyom mentioned this pull request Aug 21, 2025

add epp servicemonitor to examples llm-d-incubation/llm-d-infra#192

Closed

liu-cong reviewed Aug 22, 2025

View reviewed changes

liu-cong mentioned this pull request Aug 25, 2025

Add GKE monitoring config to the helm chart #1452

Closed

sallyom force-pushed the add-epp-svcmonitor-main branch from dd7522e to 8fe1262 Compare August 26, 2025 13:13

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 26, 2025

sallyom force-pushed the add-epp-svcmonitor-main branch from 8fe1262 to 0fb68bd Compare August 26, 2025 13:23

JeffLuoo reviewed Aug 26, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2025

danehans reviewed Sep 2, 2025

View reviewed changes

sallyom force-pushed the add-epp-svcmonitor-main branch from 0fb68bd to 2a6520d Compare September 3, 2025 02:25

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2025

liu-cong reviewed Sep 3, 2025

View reviewed changes

sallyom force-pushed the add-epp-svcmonitor-main branch from 2a6520d to 5b33d0a Compare September 4, 2025 03:51

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 4, 2025

sallyom force-pushed the add-epp-svcmonitor-main branch 2 times, most recently from a7f1d6e to 45eef27 Compare September 4, 2025 04:28

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025

sallyom force-pushed the add-epp-svcmonitor-main branch from 5c5e963 to b6ef26d Compare September 5, 2025 14:37

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025

sallyom added 2 commits September 8, 2025 17:02

epp servicemonitor and clusterpodmonitor templates

b736e51

Signed-off-by: sallyom <somalley@redhat.com>

add monitoring chart doc

d063ad9

Signed-off-by: sallyom <somalley@redhat.com>

sallyom force-pushed the add-epp-svcmonitor-main branch from b6ef26d to d063ad9 Compare September 8, 2025 21:03

k8s-ci-robot assigned liu-cong Sep 10, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2025

ahg-g reviewed Sep 10, 2025

View reviewed changes

JeffLuoo approved these changes Sep 10, 2025

View reviewed changes

k8s-ci-robot assigned ahg-g Sep 12, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2025

k8s-ci-robot merged commit 29ea290 into kubernetes-sigs:main Sep 12, 2025
10 checks passed

zetxqx mentioned this pull request Sep 19, 2025

v1.0.1 patch release #1616

Open

		@@ -1,4 +1,4 @@
		{{- if eq (lower .Values.provider.name) "gke" }}
		{{- if and (eq (lower .Values.provider.name) "gke") .Values.inferenceExtension.monitoring.gke.enabled }}

epp servicemonitor #1425

epp servicemonitor #1425

Uh oh!

Conversation

sallyom commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

sallyom commented Aug 22, 2025

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sallyom Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sallyom Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danehans commented Sep 2, 2025

Uh oh!

sallyom commented Sep 3, 2025

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sallyom commented Aug 21, 2025 •

edited

Loading

netlify bot commented Aug 21, 2025 •

edited

Loading

sallyom Aug 26, 2025 •

edited

Loading

sallyom Sep 3, 2025 •

edited

Loading

sallyom commented Sep 5, 2025 •

edited

Loading

sallyom commented Sep 5, 2025 •

edited

Loading