Skip to content

docs: add prometheus + grafana deployment guide #1019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 15, 2025

Conversation

EyalPazz
Copy link
Member

/fixes #747

Copy link

netlify bot commented Jun 19, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 5d60f5d
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6890e75fa982f200086dfb01
😎 Deploy Preview https://deploy-preview-1019--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 19, 2025
@EyalPazz
Copy link
Member Author

@nirrozenbaum @danehans

Comment on lines 147 to 149
helm install prometheus prometheus-community/prometheus \
--namespace monitoring --create-namespace \
-f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/prometheus/values.yaml
Copy link
Contributor

@nirrozenbaum nirrozenbaum Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after running this step, I'm running the command from the helm output to get prometheus pod:

Get the Prometheus server URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=prometheus,app.kubernetes.io/instance=prometheus" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace monitoring port-forward $POD_NAME 9090

when running the export POD_NAME=... I get the following error:

error: error executing jsonpath "{.items[0].metadata.name}": Error executing template: array index out of bounds: index 0, length 0. Printing more information for debugging the template:
	template was:
		{.items[0].metadata.name}
	object given to jsonpath engine was:
		map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":""}}

no pod matches the command. maybe something is missing from values.yaml?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relying on the merge to main, it's currently pointing to the main branch despite it's not there yet

You can currently test using:
https://raw.githubusercontent.com/eyalpazz/gateway-api-inference-extension/add-prometheus-guide/config/manifests/prometheus/values.yaml

Copy link
Contributor

@nirrozenbaum nirrozenbaum Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the error after I copied values.yaml locally and used it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind trying the instructions on a fresh kind cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, i'll try to do it later tonight

@EyalPazz EyalPazz requested a review from nirrozenbaum June 24, 2025 15:15
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2025
@danehans
Copy link
Contributor

@EyalPazz pls rebase.

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 9, 2025
@EyalPazz EyalPazz force-pushed the add-prometheus-guide branch from d9739ce to 2dc2f1a Compare July 9, 2025 15:10
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 9, 2025
@EyalPazz EyalPazz force-pushed the add-prometheus-guide branch from 2dc2f1a to c4fd7e6 Compare July 9, 2025 15:13
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2025
@EyalPazz EyalPazz force-pushed the add-prometheus-guide branch 2 times, most recently from 7508fbe to eb6ab1e Compare July 9, 2025 15:40
@EyalPazz
Copy link
Member Author

EyalPazz commented Jul 9, 2025

Rebased, sry for the delay

@EyalPazz
Copy link
Member Author

EyalPazz commented Jul 9, 2025

@nirrozenbaum did it work for you after we talked in private?

@nirrozenbaum
Copy link
Contributor

@nirrozenbaum did it work for you after we talked in private?

@EyalPazz sorry for the delayed response! we had some stressed milestone and I wasn't available.
unfortunately, this PR is not working for me (therefore cannot approve it as is).
I think it would be good if you document the env you're using, like version of the cluster, what type of cluster (k8s, kind, openshift, etc), in order to isolate the problem. it might help identifying the differences between your env and mine.

@EyalPazz
Copy link
Member Author

Hi!! all good
I'm not currently able to start a fresh kind cluster as something with the crd's seems to be off (i wrote about it in the slack channel), i think i know what the problem is though, the Service Account, RoleBinding etc, they are all being created in the default namespace, i forgot to change / mention it in the guide. I hope i'll be able to retest it soon, but if you have the time to check if that's actually the problem i'd greatly appreciate it

@nirrozenbaum

@nirrozenbaum
Copy link
Contributor

@EyalPazz are you able to work with release 0.5 branch? seems like main quickstart is broken

@EyalPazz EyalPazz force-pushed the add-prometheus-guide branch from eb6ab1e to 69bbbfd Compare July 23, 2025 20:17
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 23, 2025
@EyalPazz
Copy link
Member Author

0.5 branch was also kinda sketchy, but i fixed the problem
I forgot to mention the RBAC Resources at the top of the page should be applied at the monitoring namespace :(
I thought it would be better to add it to the config directory for this specific use case and not change it for everybody, should be good to go, sry for the delay @nirrozenbaum

@EyalPazz
Copy link
Member Author

@nirrozenbaum the tests are failing because of kubectl-validate that is being run on the manifests directory, and the values.yaml doesn't pass it as it isn't a resource, and i didn't find an option to ignore this file elegantly, should i create a new directory instead of placing it in manifests? and if so, what should be the name of it

@kfswain
Copy link
Collaborator

kfswain commented Jul 24, 2025

I think the values.yaml file should be pulled out of the manifests dir since its not a k8s manifest. We can probably put this under its own directory within config, as far as name of the dir, we can do like observability or something like that. i dont think we need to overthink this one

@EyalPazz
Copy link
Member Author

Should be good to go @nirrozenbaum

kind: ClusterRoleBinding
metadata:
name: inference-gateway-sa-metrics-reader-role-binding
namespace: monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterRoleBinding is a cluster-scoped resource, so remove the namespace field (to avoid confusion).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw in the current guide it has a namespace so i just went along with it, i can open another PR to fix it, in order to not "complicate" this one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw in the current guide it has a namespace so i just went along with it, i can open another PR to fix it, in order to not "complicate" this one

@EyalPazz, that would be much appreciated.

namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these resources are already covered in https://gateway-api-inference-extension.sigs.k8s.io/guides/metrics-and-observability/#scrape-metrics-pprof-profiles. To avoid confusion and misconfiguration, this file should only contain ServiceAccount and an updated ClusterRoleBinding that includes an additional subjects entry for the new ServiceAccount:

- kind: ServiceAccount
  name: inference-gateway-sa-metrics-reader
  namespace: monitoring

Maybe use kubectl to patch the ClusterRoleBinding with this entry, so this file only contains the ServiceAccount.

Comment on lines 142 to 145
```bash
export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace monitoring port-forward $POD_NAME 3000
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify with a single command that port-forwards the grafana deployment:

kubectl -n monitoring port-forward deploy/grafana 3000

Since Grafana is not configured to use the default admin/admin login info, add a step to get the password from the secret. For example:

kubectl -n monitoring get secret grafana \
  -o go-template='{{ index .data "admin-password" | base64decode }}'

Add a note such as "You can now access the Grafana UI from http://127.0.0.1"


=== "Self-Hosted"

Create Necessary ServiceAccount and RBAC Resources:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Necessary/the necessary/ and s/Resources/resources/

or if you update rbac.yaml to only include the SA:

s/Create Necessary ServiceAccount and RBAC Resources:/Create the necessary ServiceAccount resource:/

and then include the kubectl patch command to patch the ClusterRoleBinding with a subject for the monitoring ServiceAccount.

@danehans
Copy link
Contributor

danehans commented Aug 2, 2025

@EyalPazz PTAL and my review comments above. Let me know if you have any questions.

@EyalPazz EyalPazz force-pushed the add-prometheus-guide branch from b1e3dd2 to 5d60f5d Compare August 4, 2025 17:01
@EyalPazz
Copy link
Member Author

EyalPazz commented Aug 4, 2025

@danehans wdyt?

@danehans
Copy link
Contributor

/lgtm
/approve

@EyalPazz, thanks for your patience in getting this PR merged.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danehans, EyalPazz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 15, 2025
@k8s-ci-robot k8s-ci-robot merged commit 478069d into kubernetes-sigs:main Aug 15, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics dashboard should be documented for options other than Google Managed Prometheus
5 participants