Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inputs.prometheus] SIGSEGV on startup with Kubernetes 1.20 #10085

Closed
sfitts opened this issue Nov 10, 2021 · 12 comments · Fixed by #10215
Closed

[inputs.prometheus] SIGSEGV on startup with Kubernetes 1.20 #10085

sfitts opened this issue Nov 10, 2021 · 12 comments · Fixed by #10215
Labels
bug unexpected problem or unintended behavior

Comments

@sfitts
Copy link

sfitts commented Nov 10, 2021

Relevent telegraf.conf

[agent]
      collection_jitter = "0s"
      debug = false
      flush_interval = "30s"
      flush_jitter = "1s"
      hostname = "$HOSTNAME"
      interval = "30s"
      logfile = ""
      metric_batch_size = 1000
      metric_buffer_limit = 10000
      omit_hostname = false
      precision = ""
      quiet = false
      round_interval = true
    [[processors.enum]]
       [[processors.enum.mapping]]
        dest = "status_code"
        field = "status"
        [processors.enum.mapping.value_mappings]
            critical = 3
            healthy = 1
            problem = 2


    [[outputs.influxdb]]
      database = "kubernetes"
      insecure_skip_verify = false
      password = ""
      retention_policy = ""
      timeout = "5s"
      url = "http://influxdb:8086"
      user_agent = "telegraf"
      username = ""

    [[inputs.prometheus]]
      monitor_kubernetes_pods = true
    [[inputs.internal]]
      collect_memstats = false

System info

Telegraf 1.20.3, Kubernetes 1.20.7

Docker

No response

Steps to reproduce

  1. Deploy Telegraf using the helm chart in the official repo to deploy telegraf in a K8s 1.20 cluster. We have seen this failure in both EKS and AKS.
  2. Use the configuration shown above via the ConfigMap.
  3. Observe the the pod produces an error on startup.

Expected behavior

Telegraf should start and the prometheus input should start scraping the pods it finds via discovery. This works just fine in Kubernetes 1.19, but fails as describe above in K8s 1.20.

Actual behavior

Telegraf pod dies immediately with the following error:

2021-11-10T02:06:37Z I! Starting Telegraf 1.20.3
2021-11-10T02:06:37Z I! Using config file: /etc/telegraf/telegraf.conf
2021-11-10T02:06:37Z I! Loaded inputs: internal prometheus
2021-11-10T02:06:37Z I! Loaded aggregators:
2021-11-10T02:06:37Z I! Loaded processors: enum
2021-11-10T02:06:37Z I! Loaded outputs: influxdb
2021-11-10T02:06:37Z I! Tags enabled: host=telegraf-polling-service
2021-11-10T02:06:37Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"telegraf-polling-service", Flush Interval:30s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x285f71c]

goroutine 36 [running]:
github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).watchPod(0xc000476fc0, {0x575c368, 0xc0002a04c0}, 0x0)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/prometheus/kubernetes.go:113 +0xfc
github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).startK8s.func1()
        /go/src/github.com/influxdata/telegraf/plugins/inputs/prometheus/kubernetes.go:92 +0x24c
created by github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).startK8s
        /go/src/github.com/influxdata/telegraf/plugins/inputs/prometheus/kubernetes.go:79 +0x2af

Additional info

No response

@sfitts sfitts added the bug unexpected problem or unintended behavior label Nov 10, 2021
@powersj
Copy link
Contributor

powersj commented Nov 10, 2021

Hi,

That trace is from this bit of code:

func (p *Prometheus) watchPod(ctx context.Context, client *kubernetes.Clientset) error {
	watcher, err := client.CoreV1().Pods(p.PodNamespace).Watch(ctx, metav1.ListOptions{
		LabelSelector: p.KubernetesLabelSelector,
		FieldSelector: p.KubernetesFieldSelector,
	})
	defer watcher.Stop()
	if err != nil {
		return err
	}

The nil pointer happens when trying to defer watcher.Stop(). Looks like there was an error trying to get a watch interface. If I put up a branch that checks the error first and build some Telegraf artifacts are you in a position to take a test build so we can see what error comes up?

@sfitts
Copy link
Author

sfitts commented Nov 10, 2021

@powersj -- yep, not in a position to build the code, but should be able to modify the existing Docker image with a new executable for use in our K8s cluster.

@powersj
Copy link
Contributor

powersj commented Nov 10, 2021

Alright, #10091 has artifacts now attached to it. Can you please give those a shot?

Thanks!

@sfitts
Copy link
Author

sfitts commented Nov 10, 2021

New output is

$ kctl logs telegraf-prom-54ddc5595f-mcbv9
2021-11-10T21:15:56Z I! Starting Telegraf
2021-11-10T21:15:56Z I! Using config file: /etc/telegraf/telegraf.conf
2021-11-10T21:15:56Z I! Loaded inputs: internal prometheus
2021-11-10T21:15:56Z I! Loaded aggregators:
2021-11-10T21:15:56Z I! Loaded processors: enum
2021-11-10T21:15:56Z I! Loaded outputs: influxdb
2021-11-10T21:15:56Z I! Tags enabled: host=telegraf-polling-service
2021-11-10T21:15:56Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"telegraf-polling-service", Flush Interval:30s
2021-11-10T21:15:57Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:15:58Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:15:59Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:00Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:01Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:02Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:03Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:04Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:05Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:06Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:07Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:08Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:09Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:10Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:11Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)

@sfitts
Copy link
Author

sfitts commented Nov 12, 2021

Tried to work around this by using the node level scrape, but I can't get that working either. I don't see any errors in the log, but it also isn't scraping from any pods (turning on debug tracing does not emit the "will scrape metrics" message from registerPod). Hard to tell what's wrong beyond that however.

@ahothan
Copy link
Contributor

ahothan commented Dec 1, 2021

This is likely caused by permission issue of your telegraf pod. The error message is not very helpful.
Check if you have the proper rbac for your telegraf pod.
For this to work, it requires to create a cluster role and role binding, something like:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf-sa

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  # "namespace" omitted since ClusterRoles are not namespaced
  name: pod-reader
rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing Pod
  # objects is "pods"
  resources: ["pods"]
  verbs: ["get", "list", "watch"] 

---

apiVersion: rbac.authorization.k8s.io/v1
# This role binding allows "telegraf-sa" to read pods in the "default" namespace.
# You need to already have a Role named "pod-reader" in that namespace.
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default  # put telegraf namespace here
subjects:
# You can specify more than one "subject"
- kind: ServiceAccount
  name: telegraf-sa 
  apiGroup: rbac.authorization.k8s.io
roleRef:
  # "roleRef" specifies the binding to a Role / ClusterRole
  kind: ClusterRole #this must be Role or ClusterRole
  name: pod-reader # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io

@sfitts
Copy link
Author

sfitts commented Dec 1, 2021

We had assumed it was probably something along those lines, but we have yet to find the incantation which allows it to work. Currently we have the following definition for the SA, Role, RoleBinding. Note that this configuration works fine on our K8s 1.18 based clusters.

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    meta.helm.sh/release-name: telegraf-prom
    meta.helm.sh/release-namespace: shared
  labels:
    app.kubernetes.io/instance: telegraf-prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    helm.sh/chart: telegraf-1.8.6
  name: telegraf-prom
  namespace: shared
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  annotations:
    meta.helm.sh/release-name: telegraf-prom
    meta.helm.sh/release-namespace: shared
  labels:
    app.kubernetes.io/instance: telegraf-prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    helm.sh/chart: telegraf-1.8.6
  name: telegraf-prom
  namespace: shared
  resourceVersion: "3529662"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: telegraf-prom
subjects:
- kind: ServiceAccount
  name: telegraf-prom
  namespace: shared
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  annotations:
    meta.helm.sh/release-name: telegraf-prom
    meta.helm.sh/release-namespace: shared
  labels:
    app.kubernetes.io/instance: telegraf-prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    helm.sh/chart: telegraf-1.8.6
  name: telegraf-prom
  namespace: shared
rules: null

On our 1.20 based EKS cluster the example you give above fails to apply with:

The RoleBinding "read-pods" is invalid: subjects[0].apiGroup: Unsupported value: "rbac.authorization.k8s.io": supported values: "" 

Removing that key from the subjects section allows it to be applied. Unfortunately it produces the same result. We still see the (admittedly unhelpful) error message in the telegraf log.

It does seem to be permission related, but we've yet to find the grants that will make it happy (at least starting with K8s 1.20).

@ahothan
Copy link
Contributor

ahothan commented Dec 2, 2021

sorry, the apiGroup line in kind service account in my example above needs to be removed.
Here is a config that was just tested and works with k8s 1.21:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: telegraf-k8s-role-{{.Release.Name}}
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
---
# Rolebinding for namespace to cluster-admin
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: telegraf-k8s-role-{{.Release.Name}}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: telegraf-k8s-role-{{.Release.Name}}
subjects:
- kind: ServiceAccount
  name: telegraf-k8s-{{ .Release.Name }}
  namespace: {{ .Release.Namespace }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf-k8s-{{ .Release.Name }}

(ClusterRole/Binding instead of Role/Binding is needed in case of cluster level pods watch)

@ahothan
Copy link
Contributor

ahothan commented Dec 2, 2021

in your config, the rules is empty (null), have you tried adding the same rules as those defined in my config above?
You also use Role/RoleBinding with the namespace "shared", this will likely not allow to watch pods from other namespaces (fine if you only want to watch that namespace)

@sfitts
Copy link
Author

sfitts commented Dec 2, 2021

in your config, the rules is empty (null), have you tried adding the same rules as those defined in my config above?

Yes, sorry I was unclear about that -- I replaced our definitions with the ones you originally provided and got the same error. However, we haven't tried the ones you just provided in your most recent update. But once we do we'll let you know the result. Thanks!

@sfitts
Copy link
Author

sfitts commented Dec 2, 2021

Confirmed that the configuration above works. Looks like K8s actually started applying some security to the watch API (which is good). Thanks for the help in finding the correct permissions to use.

ahothan added a commit to ahothan/telegraf that referenced this issue Dec 3, 2021
Address documentation gap
@ahothan
Copy link
Contributor

ahothan commented Dec 3, 2021

cool! thanks for verifying.

ahothan added a commit to ahothan/telegraf that referenced this issue Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants