Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics] Kubernetes Prometheus metrics address #687

Closed
jacekwachowiak opened this issue May 9, 2019 · 52 comments
Closed

[Metrics] Kubernetes Prometheus metrics address #687

jacekwachowiak opened this issue May 9, 2019 · 52 comments
Assignees

Comments

@jacekwachowiak
Copy link
Contributor

Following the documentation on http://clipper.ai/tutorials/metrics/ I should be able to find the Prometheus address to see some metrics of the Clipper deployments, but the suggested method clipper_conn.get_metric_addr() seems not to exist (anymore?). It is also not listed here http://docs.clipper.ai/en/v0.3.0/clipper_connection.html.

My Kubernetes cluster looks ok, it has a pod for metrics so the question is : How can I get the address to access Prometheus easily? I tried to reuse some parts of the output of get_query_addr() but without success.

@rkooo567
Copy link
Collaborator

rkooo567 commented May 9, 2019

There is a function for get_metric_addr inside KubernetesContainerManager, but as you said, there's no way to access through clipper_conn. I am not quiet sure why. @simon-mo , can you clarify this part? If there's no special reason, we can have a patch. One way you can do right now is
clipper_conn.cm.get_metric_addr(). It is a dirty way of doing this.

@jacekwachowiak
Copy link
Contributor Author

Thank you for the quick answer. Indeed, the
clipper_conn.cm.get_metric_addr() returns a link:
'172.17.1.34:8080/api/v1/namespaces/default/services/metrics-at-default-cluster:1390/proxy'. I have a different problem now though. Going to the browser and using this address returns:

kind	"Status"
apiVersion	"v1"
metadata	
status	"Failure"
message	"no service port 1390 found for service \"metrics-at-default-cluster\""
reason	"ServiceUnavailable"
code	503

At the same time http://172.17.1.34:8080/api/v1/namespaces/default/services/metrics-at-default-cluster returns a nice description.

I am using kubectl proxy with kubectl proxy --address='172.17.1.34' --accept-hosts='^127\.0\.0\.1$,^172\.17\.1\.' --port=8080 where the IP with .34 is the master node with Clipper. I try to access the links from a Jupyter notebook opened in a browser on another machine from the 172.17.1.* network (the master node has no GUI). I accessed Prometheus with the Docker-only version the same way so I somehow doubt that this is the problem.

Do you have any idea where the problem could be?

@rkooo567
Copy link
Collaborator

Seems like the address is trying to access the port 1390 (deployment port) although our service port is 9090?
https://github.com/ucbrise/clipper/blob/develop/clipper_admin/clipper_admin/kubernetes/prom_service.yaml
Can you try 172.17.1.34:8080/api/v1/namespaces/default/services/metrics-at-default-cluster:9090/proxy?

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 13, 2019

Going to 172.17.1.34:8080/api/v1/namespaces/default/services/metrics-at-default-cluster confirms that there is a TCP port at 9090 but I cannot connect - loading without success. Still, since the behavior is different, I will try to debug further.
EDIT:
After a long while the link was changed to ...9090/proxy/graph but the connection was lost (read: connection reset by peer).
EDIT 2:
I checked with the link generated for the model and it seems that the problem occurs at the moment the port is included. I am not quite sure if that's a Clipper issue, maybe purely Kubernetes, but if anyone has encountered something similar, it would be very helpful! (http://172.17.1.34:8080/api/v1/namespaces/default/services/query-frontend-at-default-cluster loads a correct JSON, http://172.17.1.34:8080/api/v1/namespaces/default/services/query-frontend-at-default-cluster:1337/ shows a Failure/Not found)

@rkooo567
Copy link
Collaborator

Hmm.. so when port is included in the url, it fails. Could you access Prometheus in this way?

Also, @RehanSD @simon-mo @withsmilo, have you guys encountered similar issues?

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 13, 2019

No, it shows loading for a few minutes and then that Error: 'read tcp 10.233.90.0:44044->10.233.96.59:9090: read: connection reset by peer' Trying to reach: 'http://10.233.96.59:9090/graph'.

Unfortunately after stopping Clipper completely and rerunning it with:

from clipper_admin import ClipperConnection, KubernetesContainerManager
from clipper_admin.deployers import python as python_deployer
clipper_conn = ClipperConnection(KubernetesContainerManager
                                 (useInternalIP=True, 
                                  create_namespace_if_not_exists=True,
                                  kubernetes_proxy_addr="127.0.0.1:8080"))
clipper_conn.start_clipper()

It gets stuck at:

19-05-13:13:11:12 WARNING  [decorators.py:34] [default-default-cluster] Clipper still initializing: 
 query frontend end point http://172.17.1.34:8080/api/v1/namespaces/default/services/query-frontend-at-default-cluster:1337/proxy/metrics health check failed, Retrying in 1 seconds...

After a few tries and ~25 seconds it continues with the warning (error more likely) below

19-05-13:13:11:18 WARNING  [decorators.py:34] [default-default-cluster] Clipper still initializing: 
 HTTPConnectionPool(host='172.17.1.34', port=8080): Read timed out. (read timeout=5), Retrying in 1 seconds...

I don't know exactly how I fixed that before, but I think that kubectl proxy is somehow responsible, as it sometimes show context canceled error. If I stop it and continue with other steps I can deploy and query the model though.
EDIT: I ran Clipper without the kubectl proxy and the start warning/error is gone. Still, I cannot connect to Prometheus.

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 13, 2019

Because model deployment is ok, querying is ok and I cannot get to the Prometheus address in the browser with or without proxy I have to suspect that there is something wrong with the Prometheus pod. Both ways get stuck for a long time and try to redirect to .../graph without success.

This is the pod log: (level=info messages removed), the error starts immediately after the launch

[cloud-user@node1 ~]$ kubectl logs metrics-at-default-cluster-599c7f5b4f-8qpbr
level=error ts=2019-05-13T05:41:51.96305779Z caller=main.go:221 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:296: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:default:default\" cannot list resource \"pods\" in API group \"\" at the cluster scope"...

@rkooo567
Copy link
Collaborator

Are you using any cloud service with RBAC? Seems like the log is related to RBAC configuration. RBAC has been problems for KubernetesContainerManager based on https://github.com/ucbrise/clipper/issues?utf8=%E2%9C%93&q=RBAC.

We don't natively support RBAC config now. #564. This can be a temporary solution.

Also, there is one PR that tried to tackle this problem https://github.com/ucbrise/clipper/pull/605/files. Feel free to submit a PR if you resolve it!

@jacekwachowiak
Copy link
Contributor Author

I checked and yes, Kubespray by default launches RBAC with Kubernetes. I will try to create my cluster from scratch without it and I will update the issue here

@withsmilo
Copy link
Collaborator

@jacekwachowiak
Hi, The official K8s document gives the service’s proxy URL in case that you use kubectl proxy.

http://kubernetes_master_address/api/v1/namespaces/namespace_name/services/service_name[:port_name]/proxy

@jacekwachowiak
Copy link
Contributor Author

Yes, I've seen how the urls are created, but the problem is not there.
I recreated a Kubernetes cluster without RBAC, but it seems it's not working well now, so I may have to try the #605 PR's solution. Can we count on this PR or sth similar to be included by default in Clipper?

@rkooo567
Copy link
Collaborator

The PR is not verified yet. Based on the fact that he created a PR, I believe it should work though.

He was going through a similar problem like you before and asked a question here. #564. @simon-mo said, "All we need is to add RBAC support in our kubernetes config files." You can probably try to create a proper RBAC config files and manually apply them to your cluster using kubectl as well.

@rkooo567
Copy link
Collaborator

Also, did you get the same pod logs when you created your cluster without RBAC?

@jacekwachowiak
Copy link
Contributor Author

Apparently using Kubespray without RBAC is problematic (looking at how it creates the cluster), I got some pods stuck in pending state, and the same happened to Clipper deployment - it crashed after 5min, as programmed when there was no progress. Since I cannot limit myself to one Kubernetes cluster creation tool (for now Kuberspray on Openstack), I need to consider that on other cluster RBAC might be not optional (e.g. EKS AWS) and right now I think it's more important to check if Clipper can work with or/and without it. I will let you know about everything I find.

@jacekwachowiak
Copy link
Contributor Author

I used the PR and to no avail :(

level=error ts=2019-05-14T05:27:58.957257508Z caller=main.go:221 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:296: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:default:default\" cannot list resource \"pods\" in API group \"\" at the cluster scope"

the error is in the Prometheus pod from the beginning, as it was.

@rkooo567
Copy link
Collaborator

rkooo567 commented May 14, 2019

@simon-mo created a new PR on the top of #564. #694. If this passes kube_metric.py test, it should be a solution.

If PR doesn't work, there can be two other things you can try. Also, please make sure ClusterRole, ClusterRoleBinding, and ServiceAccount

1

serviceAccount:
prometheus

to prometheus deployment https://github.com/ucbrise/clipper/blob/develop/clipper_admin/clipper_admin/kubernetes/prom_deployment.yaml under sepc. Here is an example of defining serviceAccount in deployment.https://stackoverflow.com/questions/44505461/how-to-configure-a-non-default-serviceaccount-on-a-deployment

2

Other solution can be to bind default serviceAccount to ClusterRole. In this case, you can change the name prometheus to default in ClusterRoleBinding.

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 15, 2019

Thank you for the fast and extensive reply, I'll take a look at the PR asap!

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 15, 2019

I saw that the PR #693 about the metric address method was merged, I updated the repo and tried it - it returns no error now, but it does not return the address either:

print(clipper_conn.cm.get_metric_addr())

172.17.1.35:31894

print(clipper_conn.get_metric_addr())

None

@rkooo567
Copy link
Collaborator

Lol I forgot to add return. I will create a PR real soon. Please use cm.get_metric_addr until then.

Also, so you could access the metric through proxy now?

@jacekwachowiak
Copy link
Contributor Author

Yes, no problem about that :)
And no, I couldn't, but for now I just took the PR without the other things you mentioned and the Prometheus pod returns a single error:

level=error ts=2019-05-15T02:30:54.097265516Z caller=manager.go:214 component="discovery manager scrape" msg="Cannot create Kubernetes discovery" err="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

so maybe the step about the service account can fix that. Going to the address in the browser ends still in loading forever

@rkooo567
Copy link
Collaborator

Can you try this
kube-apiserver -h | grep enable-admission-plugins and see if ServiceAccount is included?

Also, @simon-mo You might need to see this for debugging.

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 15, 2019

Can you give me some more details how to run this? I get this and I cannot find a good example how to use kube-apiserver:

[cloud-user@node1 ~]$ kube-apiserver -h | grep enable-admission-plugins
-bash: kube-apiserver: command not found

Does this help?

[cloud-user@node1 ~]$ kubectl api-resources
NAME                              SHORTNAMES   APIGROUP                       NAMESPACED   KIND
bindings                                                                      true         Binding
componentstatuses                 cs                                          false        ComponentStatus
configmaps                        cm                                          true         ConfigMap
endpoints                         ep                                          true         Endpoints
events                            ev                                          true         Event
limitranges                       limits                                      true         LimitRange
namespaces                        ns                                          false        Namespace
nodes                             no                                          false        Node
persistentvolumeclaims            pvc                                         true         PersistentVolumeClaim
persistentvolumes                 pv                                          false        PersistentVolume
pods                              po                                          true         Pod
podtemplates                                                                  true         PodTemplate
replicationcontrollers            rc                                          true         ReplicationController
resourcequotas                    quota                                       true         ResourceQuota
secrets                                                                       true         Secret
serviceaccounts                   sa                                          true         ServiceAccount
...
[cloud-user@node1 ~]$ kubectl get serviceaccounts
NAME                         SECRETS   AGE
default                      1         26h
default-cluster-prometheus   1         176m

@rkooo567
Copy link
Collaborator

To use ServiceAccount (in our case, prometheus service account is on in prom_deployment), we should make sure ServiceAccount admission plugin is on. The command above is what I found from the link, https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/.

I found how to access kube-apiserver from stackoverflow. https://stackoverflow.com/questions/50352621/where-is-kube-apiserver-located

Here is the important part.
"""
kube-apiserver is running as a Docker container on your master node.
Therefore, the binary is within the container, not on your host system. It is started by the master's kubelet from a file located at /etc/kubernetes/manifests.
kubelet is watching this directory and will start any Pod defined here as "static pods".

To configure kube-apiserver command line arguments you need to modify /etc/kubernetes/manifests/kube-apiserver.yaml on your master.
"""
So, I guess you can get into the master node and cat /etc/kubernetes/manifests/kube-apiserver.yaml. After that, check if ServiceAccount is included in admission-plugins.

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 15, 2019

[cloud-user@node1 ~]$ sudo cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep admission-plugins
    - --enable-admission-plugins=NodeRestriction

so it seems it is not

@rkooo567
Copy link
Collaborator

Okay. Seems like NodeRestriction is the only admission-plugins. The document says we should add ServiceAccount to enable serviceAccount for RBAC.

Can you try to add ServiceAccount in the yaml file? I think you can add
--enable-admission-plugins=NodeRestriction,ServiceAccount.

If it doesn't work, you can find the kube-apiserver binary from the matser node and run
kube-apiserver --enable-admission-plugins=NodeRestriction,ServiceAccount

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 15, 2019

On it, I'll update soon

@rkooo567
Copy link
Collaborator

BTW, this is the source about we need ServiceAccount plugin
kubernetes/kubernetes#27973

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 15, 2019

Do I have to restart the pod/container to be sure that the change was applied?
Update: It restarted automatically so sorry for the fuss!

@rkooo567
Copy link
Collaborator

rkooo567 commented May 15, 2019

It says kublet periodically scans all the manifest files to update. So I believe it should be fine? I am not 100% sure about this. https://stackoverflow.com/questions/50007654/how-does-kube-apiserver-restart-after-editing-etc-kubernetes-manifests-kube-api

@rkooo567
Copy link
Collaborator

@jacekwachowiak How's the result? Are you able to access prometheus now?

@jacekwachowiak
Copy link
Contributor Author

Sorry for the delay but the Murphy's law got involved - the cloud I am using has currently network problems and I have to wait until it's fixed! It stopped working the moment I changed the manifest, but it's surely not the cause 😅

@rkooo567
Copy link
Collaborator

Haha. gotcha! Let me know if it will resolve the issue!

@jacekwachowiak
Copy link
Contributor Author

I'm back but without good news, nothing changed. After adding ServiceAccount to the /etc/kubernetes/manifests/kube-apiserver.yaml at - --enable-admission-plugins=NodeRestriction I am still getting the same error in the logs:
caller=manager.go:214 component="discovery manager scrape" msg="Cannot create Kubernetes discovery" err="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
Prometheus address does not load in the browser with or without the proxy.

@rkooo567
Copy link
Collaborator

rkooo567 commented May 15, 2019

Sorry to hear that. Let's try 3 different things, and if it is not solved, we will work on it shortly. You should probably wait until it is resolved.

1

In prom_deployment, under serviceAccountName, change this
automountServiceAccountToken: false to automountServiceAccountToken: true. And try to reapply the prom_deployment.

So it should look like

#prom_deployment file. pod spec
spec:
    serviceAccountName: prometheus
    automountServiceAccountToken: true

2

I think it is a bad solution. Don't try this.

If the above doesn't work, remove serviceAccount and automountServiceAccountToken from prom_deployment. After that, go to rbac_cluster_role_binding.yaml and change the

subject:
  -kind: ServiceAccount
    name: default (from {{cluster_name}}-prometheus)

3

Restart clipper with different service type. Refer this PR. #667. Set metric's serviceType to LoadBalancer. You can see the external IP address using kubectl describe {your_prom_service}. Here is an example. https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/?hl=hu

(You can also just change the prom_service manifest file directly to have LoadBalancer service type).

I am pretty sure this will work. In this case, I guess you don't need to use proxy to access the service (because loadlbalancer will expose the external IP).

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 15, 2019

For step 1: I changed both of automountServiceAccountToken: false to true in /home/cloud-user/.local/lib/python3.6/site-packages/clipper_admin/kubernetes/prom_deployment.yaml. This makes the Prometheus pod log error disappear. No other changes noticed.
I will try everything else whenever I can.

@rkooo567
Copy link
Collaborator

Hmm, so there's no more error logs, but you still cannot see the metric from the browser?

@withsmilo How did you guys resolve RBAC problem for metrics?

@withsmilo
Copy link
Collaborator

So sorry we are using the default Prometheus in in-house Kubernetes which is managed by another team, not Clipper's.

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 16, 2019

Regarding 3, I have restarted Clipper with the additional argument:

clipper_conn = ClipperConnection(KubernetesContainerManager
                                 (useInternalIP=True, 
                                  create_namespace_if_not_exists=True,
                                  service_types={'redis': 'NodePort', 'management': 'NodePort', 
                                                 'query': 'NodePort', 'query-rpc': 'NodePort', 
                                                 'metric': 'LoadBalancer'}))
clipper_conn.start_clipper()

~/.local/lib/python3.6/site-packages/clipper_admin/kubernetes/kubernetes_container_manager.py in connect(self)
    518                     port=self.clipper_metric_port))
    519             elif self.service_types['metric'] == LOAD_BALANCER:
--> 520                 self.clipper_metric_ip = v1service.status.load_balancer.ingress[0].ip
    521                 self.logger.info("Setting Clipper metric port to {ip}:{port}"
    522                                  .format(ip=self.clipper_metric_ip,

TypeError: 'NoneType' object is not subscriptable

I have reverted 1 but it only changed one thing - the log error returned (I tried 3 with both reverted and not reverted 1). I will try to set it manually to LoadBalancer too.

@rkooo567
Copy link
Collaborator

I guess it is the problem of the code itself. Yeah can you try setting load balancer manually? If it doesn’t work I will try to resolve it shortly

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 16, 2019

And also:

[cloud-user@node1 ~]$ kubectl get services
NAME                                  TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
kubernetes                            ClusterIP      10.233.0.1      <none>        443/TCP          47h
metrics-at-default-cluster            LoadBalancer   10.233.50.119   <pending>     9090:30543/TCP   2m53s

I have put the "type": "LoadBalancer" in two places in this file:

[cloud-user@node1 ~]$ cat /home/cloud-user/.local/lib/python3.6/site-packages/clipper_admin/kubernetes/prom_deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    ai.clipper.container.label: {{ cluster_name }}
    ai.clipper.name: metrics
  name: metrics-at-{{ cluster_name }}
spec:
  [HERE 1st]
  serviceAccountName: {{ service_account_name }}
  automountServiceAccountToken: false
  replicas: 1
  template:
    metadata:
      labels:
        ai.clipper.container.label: {{ cluster_name }}
        ai.clipper.name: metrics
    spec:
      [AND HERE 2nd]
      serviceAccountName: {{ service_account_name }}

It seems none of the places work. I am still getting NodePort as the service unless I specify it as KubernetesContainerManager argument

@rkooo567
Copy link
Collaborator

I think you should change the prom_service not deployment (not sure if you already did).

@rkooo567
Copy link
Collaborator

Also, it seems like there is a load balancer service. Is external IP still pending?

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented May 16, 2019

Oh, yes, I was wondering if there is another file for the service, I'm trying it right now!
After 10 minutes now the IP was not assigned for the case with setting LoadBalancer with KubernetesContainerManager
UPDATE:
Setting it manually in the service manifest allows Clipper to start without the error. The IP is not assigned though.
After reading a bit - it seems that LoadBalancer type of service might not be supported by the cloud/resource provider 1, 2, which would explain the IP pending state

@rkooo567
Copy link
Collaborator

Makes sense. If so, using RBAC with nodeport will be the only option. We will debug the RBAC PR real soon and let you know once it is succesfully tested and merged! Sorry for the inconvenience!

@jacekwachowiak
Copy link
Contributor Author

Thanks! I'll see what I can do anyway and follow all the changes in the repo :)

@rkooo567
Copy link
Collaborator

Hi, @jacekwachowiak! @simon-mo will handle this issue shortly and this fix is going to be included in the next release coming soon!

@rkooo567
Copy link
Collaborator

Hi, @jacekwachowiak I had a new update on #694, and it seems like it can read metrics within the test. Could you merge that PR locally and see if it works on your side?

@jacekwachowiak
Copy link
Contributor Author

Ok, thank you, I'll take a look and let you know when I get something

@rkooo567 rkooo567 self-assigned this May 30, 2019
@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented Jun 3, 2019

With the current repository status I can access Prometheus. There are a few things that are not working as they should.

  1. I am using the method clipper_conn.cm.get_metric_addr(). The port is ok, the problem is that the IP returned is always the same (.33) - my Kubernetes node3 (out of 3) - could it be that the last IP is somehow always shown? The metrics pod is launched randomly though, so I have to do kubectl describe pod metrics-at-default-cluster-... to check the node where the pod is and replace whatever IP the metric_addr() returned. This separately is not a big problem.
  2. Another issue is that if the metrics pod and the other Clipper pods are not started at the same node (which I don't know why for mgmt-frontend and query-fronted seems to be always .33 - I will run more times to see if something changes).
  3. The most important one - when I go to the browser to Prometheus I cannot see the metrics related to Clipper, only a few generic ones like process_* and scrape_* if the metric pod and the model pod are on different nodes. If I restart and I am lucky to have the metrics pod on the same IP as the model = everything works fine, which for my 3-node cluster is 50% of the time.
    I found out that scaling up the replicas will make metrics pod add the Clipper metrics, which probably means that it needs a model pod on the same node.
    Update: Images
    Correct Prometheus
    Incorrect Prometheus
    Update2: Relation to the model deployment, description changed

@rkooo567
Copy link
Collaborator

rkooo567 commented Jun 3, 2019

@jacekwachowiak Would you mind create a new issue since it is a different problem from the title? (So that other people can see it later.) Also, I will look into it as soon as possible.

@jacekwachowiak
Copy link
Contributor Author

Ok, on it

@rkooo567
Copy link
Collaborator

rkooo567 commented Jun 5, 2019

Resolved with #694

@rkooo567 rkooo567 closed this as completed Jun 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants