Skip to content

Conversation

sallyom
Copy link
Contributor

@sallyom sallyom commented Aug 21, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:

This adds serviceMonitor to scrape metrics from EPP pod.
This PR has been tested with release-0.5 branch changes (https://github.com/sallyom/gateway-api-inference-extension/tree/add-epp-svcmonitor-0.5)
This also adds monitoring.gke & creates the monitoringSecret when enabled.

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:

Helm charts can enable scraping metrics from EPP pod. Set inferenceExtension.monitoring.prometheus.enabled to create a ServiceMonitor that matches with EPP service. For GKE environments, monitoring is automatically configured when `provider.name` is set to `gke`

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 21, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and robscott August 21, 2025 04:30
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 21, 2025
Copy link

netlify bot commented Aug 21, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit d063ad9
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68bf449eeea971000899dc3d
😎 Deploy Preview https://deploy-preview-1425--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@sallyom
Copy link
Contributor Author

sallyom commented Aug 22, 2025

I've tested this with the commit cherry-picked onto the release-0.5 branch - where should this PR go? Against that branch or here, against main? I'm running with llm-d that brings in release-0.5 - so EPP deployment uses camelCase flags rather then hyphenated-flags Please advise, ty.

@liu-cong ptal

Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Regarding the branch, we should target the main branch.

apiVersion: v1
kind: Secret
metadata:
name: {{ include "gateway-api-inference-extension.name" . }}-token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add namespace here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "gateway-api-inference-extension.name" . }}-monitor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment, add namespace (I assume this is namespace scoped?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


# Monitoring configuration for EPP
monitoring:
# ServiceMonitor configuration for EPP metrics collection with Prometheus Operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be different monitoring providers (we definitely want to add GKE here as well), so we should structure it in a more extensible way. There are many fields here that are common to most providers, e.g., the metrics path, selector, etc. So something like:

monitoring:
  path: "metrics"
  ... # other common params
  prometheusProvider:
     # prometheus specific params

And I would argue to start with the minimal configuration possible to keep it simple. The helm charts are meant for helping with initial setup instead of full configurability (advanced users can always fork and customize however they want). See some guiding principles in the llm-d installation north star https://docs.google.com/document/d/1Y0fJGhELfdXj-Xkznhrl48sDOp_dUvuy5sX4lf9g63o/edit?tab=t.0

Copy link
Contributor Author

@sallyom sallyom Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added monitoring.gke & monitoring.prometheus ptal
For monitoring.gke.enabled - a serviceaccount secret is also created - I believe b4 this, it was up to user to manually create? AFAICT it's the same type of secret for both - the serviceaccount-token secret - in GKE as in other K8s?

path: "/metrics"
interval: "10s"
# scrapeTimeout: "10s"
labels: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we need labels, annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need those, I'll remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@sallyom sallyom force-pushed the add-epp-svcmonitor-main branch from dd7522e to 8fe1262 Compare August 26, 2025 13:13
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 26, 2025
@sallyom sallyom force-pushed the add-epp-svcmonitor-main branch from 8fe1262 to 0fb68bd Compare August 26, 2025 13:23
- interval: {{ .Values.inferenceExtension.monitoring.interval }}
port: {{ .Values.inferenceExtension.monitoring.port }}
path: {{ .Values.inferenceExtension.monitoring.path }}
authorization:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ServiceMonitor is namespace-scoped, does the secret need to reside in the same namespace of the CR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK - but I've never run with the secret in another ns so not 100% sure - definitely I'd say best practice though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(namespace is included in the secret template)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT there is no option to add namespace in that authorization.credentials section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that both the CRD and secret use the namespace template .Release.Namespace so it should be good. Closing this comment.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2025
# Prometheus ServiceMonitor configuration for EPP metrics collection with Prometheus Operator
prometheus:
enabled: false
# scrapeTimeout: "10s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is scrapeTimeout commented out?

Copy link
Contributor Author

@sallyom sallyom Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's commented because it's referenced/included in the template but has a default of 10s when not set. There are other fields with defaults in the template that aren't commented out but instead shown as empty objects/arrays, however - to keep values.yaml clean & simple, we could remove these and leave it to users to know how to find the defaulted fields - these are:

{{- with .Values.inferenceExtension.monitoring.prometheus.scrapeTimeout }}
{{- with .Values.inferenceExtension.monitoring.prometheus.relabelings }}
{{- with .Values.inferenceExtension.monitoring.prometheus.metricRelabelings }}
{{- with .Values.inferenceExtension.monitoring.prometheus.selector.matchLabels }}

So, usually, for such defaulted values, I think you'd see in the helm values a commented out value for simple values and an empty array example to show users how to set. So like:

  # scrapeTimeout: "10s"        # ← commented 
  relabelings: []               # ← empty array 
  metricRelabelings: []         # ← empty array 
  selector:
    matchLabels: {}             # ← empty object

I'll leave them as/is for now but I'm also not opposed to removing these from the values file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed these options to simplify - can add later if anyone finds they need them

@danehans
Copy link
Contributor

danehans commented Sep 2, 2025

@sallyom thanks for the PR. Please rebase and provide feedback for the review comments when you have a moment.

@sallyom sallyom force-pushed the add-epp-svcmonitor-main branch from 0fb68bd to 2a6520d Compare September 3, 2025 02:25
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2025
@sallyom
Copy link
Contributor Author

sallyom commented Sep 3, 2025

rebased, ty!

Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,4 +1,4 @@
{{- if eq (lower .Values.provider.name) "gke" }}
{{- if and (eq (lower .Values.provider.name) "gke") .Values.inferenceExtension.monitoring.gke.enabled }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: .Values.inferenceExtension.monitoring.gke.enabled should be applied to the ClusterPodMonitoring object below only. The rest is not monitoring specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops - i removed that (and there is no longer a monitoring.gke.enabled flag)

interval: "10s"
scheme: "http"
# port -- Port name to scrape metrics from (must match service port name)
port: "http-metrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default port name is metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to match the service port name, that is currently 'http-metrics' :( but I agree it should be just 'metrics' - can the service port be changed? It's here: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/charts/inferencepool/templates/epp-service.yaml#L15

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was out of office in last few days. The port name should be the one on the pod, right? And it's metrics:


# GKE monitoring configuration (ClusterPodMonitoring)
gke:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Currently we always create the ClusterPodMonitoring object if GKE is the provider. I don't know a good use where you don't want it in GKE. So I would prefer to keep that behavior, and just remove the enabled flag for GKE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it!

# Monitoring configuration for EPP
monitoring:
# Common monitoring parameters
path: "/metrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to keep the interval and secret here since others are unlikely to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed all but those

enabled: false
# scrapeTimeout: "10s"
# relabelings -- RelabelConfigs to apply to samples before scraping
relabelings: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I asked in my initial review but just asking again :) do we need all these configuration or can we just take defaults for the most part? For example, unless we have concrete use cases, I would err on the side of simplicity and just take (opinionated) good default values in the template, instead of allowing too much configurability (the it adds cognitive overhead to the users and we need to document them very well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all the optional configs, it's much simpler now

@sallyom sallyom force-pushed the add-epp-svcmonitor-main branch from 2a6520d to 5b33d0a Compare September 4, 2025 03:51
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 4, 2025
@sallyom sallyom force-pushed the add-epp-svcmonitor-main branch 2 times, most recently from a7f1d6e to 45eef27 Compare September 4, 2025 04:28
@sallyom
Copy link
Contributor Author

sallyom commented Sep 5, 2025

@liu-cong I believe I resolved all feedback, also I added to the chart docs as suggested, PTAL and TY!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025
@sallyom sallyom force-pushed the add-epp-svcmonitor-main branch from 5c5e963 to b6ef26d Compare September 5, 2025 14:37
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025
@sallyom
Copy link
Contributor Author

sallyom commented Sep 5, 2025

also, if the epp-service port-name is changed from http-metrics to metrics that can be a follow-up? not sure if there is any concern of backward compatibility with that

Deployment port is metrics
Service port is http-metrics that's what ServiceMonitor matches with.

Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
@sallyom sallyom force-pushed the add-epp-svcmonitor-main branch from b6ef26d to d063ad9 Compare September 8, 2025 21:03
@sallyom
Copy link
Contributor Author

sallyom commented Sep 9, 2025

@liu-cong anything else needed with this? TY

@liu-cong
Copy link
Contributor

liu-cong commented Sep 9, 2025

@liu-cong anything else needed with this? TY

@sallyom Can you address this question? Otherwise this lgtm. Thnaks!

image

@sallyom
Copy link
Contributor Author

sallyom commented Sep 10, 2025

@liu-cong anything else needed with this? TY

@sallyom Can you address this question? Otherwise this lgtm. Thnaks!

image

for ServiceMonitors, the service port name is used, not the pod port name. Ideally, pod port name matches service port name but not the case here.

@liu-cong
Copy link
Contributor

for ServiceMonitors, the service port name is used, not the pod port name. Ideally, pod port name matches service port name but not the case here.

Good to know, thanks! How is it going to scrape the metrics? Via the load balancing of a k8s service? Will it be able to differentiate the metrics from each pod?

@liu-cong
Copy link
Contributor

/lgtm

@danehans to approve if this looks good to you as well

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2025
name: {{ .Values.inferenceExtension.monitoring.secret.name }}
key: token
namespace: {{ .Values.gke.monitoringSecret.namespace }}
namespace: {{ .Release.Namespace }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffLuoo this means we need to change the GKE docs since we now have to create a secret under the pool's namespace and we need to create one for each epp deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update in this PR includes a new file config/charts/inferencepool/templates/epp-sa-token-secret.yaml that is the secret required for scraping the metric. The namespace uses the same template {{ .Release.Namespace }}.

Copy link
Contributor

@JeffLuoo JeffLuoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor recommendation on default scrape interval.


# Monitoring configuration for EPP
monitoring:
interval: "10s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Make the default scraping interval 15s instead of 10s.

See https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work that the rate query should have a range four times the scrape interval. The most common range in the query is 1m, hence setting interval to 15s will be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across this, which is why I set to 10s - says to set to less than 15s - wdyt?
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/tools/dashboards#troubleshooting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for catching this. I checked the dashboard and we are all using $__rate_interval. Should be good here to keep it as 10s.

@Gregory-Pereira
Copy link
Contributor

@nirrozenbaum @kfswain are we including this in this release? It was discussed as being part of the v0.3 release of llm-d I just want to check that is still our current understanding

@ahg-g
Copy link
Contributor

ahg-g commented Sep 12, 2025

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, JeffLuoo, sallyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Sep 12, 2025

@nirrozenbaum @kfswain are we including this in this release? It was discussed as being part of the v0.3 release of llm-d I just want to check that is still our current understanding

The released already happened. If this is a blocker for llm-d 0.3, we can patch it and do a minor release.

@k8s-ci-robot k8s-ci-robot merged commit 29ea290 into kubernetes-sigs:main Sep 12, 2025
10 checks passed
@zetxqx zetxqx mentioned this pull request Sep 19, 2025
kfswain pushed a commit that referenced this pull request Sep 23, 2025
* epp servicemonitor and clusterpodmonitor templates

Signed-off-by: sallyom <somalley@redhat.com>

* add monitoring chart doc

Signed-off-by: sallyom <somalley@redhat.com>

---------

Signed-off-by: sallyom <somalley@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants