Prometheus based metrics and monitoring for KFServing models (kubeflo…

…w#1276) * initial readme * readme blurb * readme blurb * anchors * renaming * renaming * install prometheus operator * install prometheus operator * install prometheus operator * kfserving prefixes * prom services * readme * readme * readme * prom operator samples * v1 cluster role and binding * images * readme * Access prometheus metrics * Access prometheus metrics * minimal prometheus setup * fixed prom queries * fixed prom queries * fixed typos * restored kfserving jupyter notebook * instructions for kustomizing install namespace
magdalenakuhn17 · Jan 8, 2021 · c884755 · c884755
1 parent da3e7be
commit c884755
Show file tree

Hide file tree

Showing 11 changed files with 160 additions and 0 deletions.
diff --git a/docs/samples/metrics-and-monitoring/README.md b/docs/samples/metrics-and-monitoring/README.md
@@ -0,0 +1,78 @@
+# Metrics and Monitoring
+
+> Getting started with Prometheus-based monitoring of KFServing models.
+
+# Table of Contents
+1. [Install Prometheus](#install-prometheus)
+2. [Access Prometheus Metrics](#access-prometheus-metrics)
+3. [Metrics-driven experiments and progressive delivery](#metrics-driven-experiments-and-progressive-delivery)
+4. [Removal](#removal)
+
+## Install Prometheus
+
+**Prerequisites:** Kubernetes cluster and [Kustomize v3](https://kubectl.docs.kubernetes.io/installation/kustomize/).
+
+Install Prometheus using Prometheus Operator.
+
+```shell
+cd kfserving
+kustomize build docs/samples/metrics-and-monitoring/prometheus-operator | kubectl apply -f -
+kubectl wait --for condition=established --timeout=120s crd/prometheuses.monitoring.coreos.com
+kubectl wait --for condition=established --timeout=120s crd/servicemonitors.monitoring.coreos.com
+kustomize build docs/samples/metrics-and-monitoring/prometheus | kubectl apply -f -
+```
+
+> Note: The above steps install Kubernetes resource objects in the `kfserving-monitoring` namespace. This is Kustomizable. To install under a different namespace, say `my-monitoring`, change `kfserving-monitoring` to `my-monitoring` in the following three files: a) `prometheus-operator/namespace.yaml`, b) `prometheus-operator/kustomization.yaml`, and c) `prometheus/kustomization.yaml`.
+
+## Access Prometheus Metrics
+In this section, we will use a v1beta1 InferenceService sample to demonstrate how to access Prometheus metrics that are automatically generated by [Knative's queue-proxy container](https://knative.dev) for your KFServing models.
+
+1. `kubectl create ns kfserving-test`
+2. `cd docs/samples/v1beta1/sklearn`
+3. `kubectl apply -f sklearn.yaml -n kfserving-test`
+4. If you are using a Minikube based cluster, then in a separate terminal, run `minikube tunnel` and supply password if prompted.
+5. In a separate terminal, follow [these instructions](https://github.com/kubeflow/kfserving/blob/master/README.md#determine-the-ingress-ip-and-ports) to find and set your ingress IP, host, and service hostname. Then, send prediction requests to the `sklearn-iris` model you created in Step 3. above as follows.
+```
+while clear; do \
+  curl -v \
+  -H "Host: ${SERVICE_HOSTNAME}" \
+  -d @./iris-input.json \
+  http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/sklearn-iris/infer
+  sleep 0.3
+done
+```
+6. In a separate terminal, port forward the Prometheus service.
+```shell
+kubectl port-forward service/prometheus-operated -n kfserving-monitoring 9090:9090
+```
+7. Access Prometheus UI in your browser at http://localhost:9090
+8. Access the number of prediction requests to the sklearn model, over the last 60 seconds. You can use the following query in the Prometheus UI: 
+
+```
+sum(increase(revision_app_request_latencies_count{service_name=~"sklearn-iris-predictor-default"}[60s]))
+``` 
+
+You should see a response similar to the following.
+
+![Request count](requestcount.png)
+
+9. Access the mean latency for serving prediction requests for the same model as above, over the last 60 seconds. You can use the following query in the Prometheus UI:
+
+```
+sum(increase(revision_app_request_latencies_sum{service_name=~"sklearn-iris-predictor-default"}[60s]))/sum(increase(revision_app_request_latencies_count{service_name=~"sklearn-iris-predictor-default"}[60s]))
+```
+
+You should see a response similar to the following.
+
+![Request count](requestlatency.png)
+
+## Metrics-driven experiments and progressive delivery
+See [iter8-kfserving](https://github.com/iter8-tools/iter8-kfserving).
+
+## Removal
+Remove Prometheus and Prometheus Operator as follows.
+```shell
+cd kfserving
+kustomize build docs/samples/metrics-and-monitoring/prometheus | kubectl delete -f -
+kustomize build docs/samples/metrics-and-monitoring/prometheus-operator | kubectl delete -f -
+```
diff --git a/docs/samples/metrics-and-monitoring/prometheus-operator/kustomization.yaml b/docs/samples/metrics-and-monitoring/prometheus-operator/kustomization.yaml
@@ -0,0 +1,5 @@
+namePrefix: kfserving-
+namespace: kfserving-monitoring
+resources:
+- github.com/prometheus-operator/prometheus-operator?ref=v0.44.1
+- namespace.yaml
diff --git a/docs/samples/metrics-and-monitoring/prometheus-operator/namespace.yaml b/docs/samples/metrics-and-monitoring/prometheus-operator/namespace.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: kfserving-monitoring
diff --git a/docs/samples/metrics-and-monitoring/prometheus/clusterrole.yaml b/docs/samples/metrics-and-monitoring/prometheus/clusterrole.yaml
@@ -0,0 +1,24 @@
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: prometheus
+rules:
+- apiGroups: [""]
+  resources:
+  - nodes
+  - nodes/metrics
+  - services
+  - endpoints
+  - pods
+  verbs: ["get", "list", "watch"]
+- apiGroups: [""]
+  resources:
+  - configmaps
+  verbs: ["get"]
+- apiGroups:
+  - networking.k8s.io
+  resources:
+  - ingresses
+  verbs: ["get", "list", "watch"]
+- nonResourceURLs: ["/metrics"]
+  verbs: ["get"]
diff --git a/docs/samples/metrics-and-monitoring/prometheus/clusterrolebinding.yaml b/docs/samples/metrics-and-monitoring/prometheus/clusterrolebinding.yaml
@@ -0,0 +1,12 @@
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: prometheus
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: prometheus
+subjects:
+- kind: ServiceAccount
+  name: prometheus
+  namespace: kfserving-monitoring
diff --git a/docs/samples/metrics-and-monitoring/prometheus/kfserving-service-monitor.yaml b/docs/samples/metrics-and-monitoring/prometheus/kfserving-service-monitor.yaml
@@ -0,0 +1,13 @@
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: monitoring
+spec:
+  namespaceSelector:
+    any: true
+  selector:
+    matchLabels:
+      networking.internal.knative.dev/serviceType: Private
+  endpoints:
+  - port: http-usermetric
+    interval: 15s
diff --git a/docs/samples/metrics-and-monitoring/prometheus/kustomization.yaml b/docs/samples/metrics-and-monitoring/prometheus/kustomization.yaml
@@ -0,0 +1,8 @@
+namePrefix: kfserving-
+namespace: kfserving-monitoring
+resources:
+- clusterrole.yaml
+- clusterrolebinding.yaml
+- prometheus.yaml
+- serviceaccount.yaml
+- kfserving-service-monitor.yaml
diff --git a/docs/samples/metrics-and-monitoring/prometheus/prometheus.yaml b/docs/samples/metrics-and-monitoring/prometheus/prometheus.yaml
@@ -0,0 +1,12 @@
+apiVersion: monitoring.coreos.com/v1
+kind: Prometheus
+metadata:
+  name: prometheus
+spec:
+  serviceAccountName: kfserving-prometheus
+  serviceMonitorNamespaceSelector: {}
+  serviceMonitorSelector: {}
+  resources:
+    requests:
+      memory: 400Mi
+  enableAdminAPI: false
diff --git a/docs/samples/metrics-and-monitoring/prometheus/serviceaccount.yaml b/docs/samples/metrics-and-monitoring/prometheus/serviceaccount.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: prometheus
diff --git a/docs/samples/metrics-and-monitoring/requestcount.png b/docs/samples/metrics-and-monitoring/requestcount.png
diff --git a/docs/samples/metrics-and-monitoring/requestlatency.png b/docs/samples/metrics-and-monitoring/requestlatency.png