[inputs.prometheus] SIGSEGV on startup with Kubernetes 1.20 #10085

sfitts · 2021-11-10T02:17:29Z

Relevent telegraf.conf

[agent]
      collection_jitter = "0s"
      debug = false
      flush_interval = "30s"
      flush_jitter = "1s"
      hostname = "$HOSTNAME"
      interval = "30s"
      logfile = ""
      metric_batch_size = 1000
      metric_buffer_limit = 10000
      omit_hostname = false
      precision = ""
      quiet = false
      round_interval = true
    [[processors.enum]]
       [[processors.enum.mapping]]
        dest = "status_code"
        field = "status"
        [processors.enum.mapping.value_mappings]
            critical = 3
            healthy = 1
            problem = 2


    [[outputs.influxdb]]
      database = "kubernetes"
      insecure_skip_verify = false
      password = ""
      retention_policy = ""
      timeout = "5s"
      url = "http://influxdb:8086"
      user_agent = "telegraf"
      username = ""

    [[inputs.prometheus]]
      monitor_kubernetes_pods = true
    [[inputs.internal]]
      collect_memstats = false

System info

Telegraf 1.20.3, Kubernetes 1.20.7

Docker

No response

Steps to reproduce

Deploy Telegraf using the helm chart in the official repo to deploy telegraf in a K8s 1.20 cluster. We have seen this failure in both EKS and AKS.
Use the configuration shown above via the ConfigMap.
Observe the the pod produces an error on startup.

Expected behavior

Telegraf should start and the prometheus input should start scraping the pods it finds via discovery. This works just fine in Kubernetes 1.19, but fails as describe above in K8s 1.20.

Actual behavior

Telegraf pod dies immediately with the following error:

2021-11-10T02:06:37Z I! Starting Telegraf 1.20.3
2021-11-10T02:06:37Z I! Using config file: /etc/telegraf/telegraf.conf
2021-11-10T02:06:37Z I! Loaded inputs: internal prometheus
2021-11-10T02:06:37Z I! Loaded aggregators:
2021-11-10T02:06:37Z I! Loaded processors: enum
2021-11-10T02:06:37Z I! Loaded outputs: influxdb
2021-11-10T02:06:37Z I! Tags enabled: host=telegraf-polling-service
2021-11-10T02:06:37Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"telegraf-polling-service", Flush Interval:30s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x285f71c]

goroutine 36 [running]:
github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).watchPod(0xc000476fc0, {0x575c368, 0xc0002a04c0}, 0x0)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/prometheus/kubernetes.go:113 +0xfc
github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).startK8s.func1()
        /go/src/github.com/influxdata/telegraf/plugins/inputs/prometheus/kubernetes.go:92 +0x24c
created by github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).startK8s
        /go/src/github.com/influxdata/telegraf/plugins/inputs/prometheus/kubernetes.go:79 +0x2af

Additional info

No response

The text was updated successfully, but these errors were encountered:

powersj · 2021-11-10T15:43:00Z

Hi,

That trace is from this bit of code:

func (p *Prometheus) watchPod(ctx context.Context, client *kubernetes.Clientset) error {
	watcher, err := client.CoreV1().Pods(p.PodNamespace).Watch(ctx, metav1.ListOptions{
		LabelSelector: p.KubernetesLabelSelector,
		FieldSelector: p.KubernetesFieldSelector,
	})
	defer watcher.Stop()
	if err != nil {
		return err
	}

The nil pointer happens when trying to defer watcher.Stop(). Looks like there was an error trying to get a watch interface. If I put up a branch that checks the error first and build some Telegraf artifacts are you in a position to take a test build so we can see what error comes up?

sfitts · 2021-11-10T15:52:06Z

@powersj -- yep, not in a position to build the code, but should be able to modify the existing Docker image with a new executable for use in our K8s cluster.

powersj · 2021-11-10T18:33:54Z

Alright, #10091 has artifacts now attached to it. Can you please give those a shot?

Thanks!

sfitts · 2021-11-10T21:18:17Z

New output is

$ kctl logs telegraf-prom-54ddc5595f-mcbv9
2021-11-10T21:15:56Z I! Starting Telegraf
2021-11-10T21:15:56Z I! Using config file: /etc/telegraf/telegraf.conf
2021-11-10T21:15:56Z I! Loaded inputs: internal prometheus
2021-11-10T21:15:56Z I! Loaded aggregators:
2021-11-10T21:15:56Z I! Loaded processors: enum
2021-11-10T21:15:56Z I! Loaded outputs: influxdb
2021-11-10T21:15:56Z I! Tags enabled: host=telegraf-polling-service
2021-11-10T21:15:56Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"telegraf-polling-service", Flush Interval:30s
2021-11-10T21:15:57Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:15:58Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:15:59Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:00Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:01Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:02Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:03Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:04Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:05Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:06Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:07Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:08Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:09Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:10Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)
2021-11-10T21:16:11Z E! [inputs.prometheus] Unable to watch resources: unknown (get pods)

sfitts · 2021-11-12T18:51:58Z

Tried to work around this by using the node level scrape, but I can't get that working either. I don't see any errors in the log, but it also isn't scraping from any pods (turning on debug tracing does not emit the "will scrape metrics" message from registerPod). Hard to tell what's wrong beyond that however.

ahothan · 2021-12-01T03:30:41Z

This is likely caused by permission issue of your telegraf pod. The error message is not very helpful.
Check if you have the proper rbac for your telegraf pod.
For this to work, it requires to create a cluster role and role binding, something like:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf-sa

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  # "namespace" omitted since ClusterRoles are not namespaced
  name: pod-reader
rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing Pod
  # objects is "pods"
  resources: ["pods"]
  verbs: ["get", "list", "watch"] 

---

apiVersion: rbac.authorization.k8s.io/v1
# This role binding allows "telegraf-sa" to read pods in the "default" namespace.
# You need to already have a Role named "pod-reader" in that namespace.
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default  # put telegraf namespace here
subjects:
# You can specify more than one "subject"
- kind: ServiceAccount
  name: telegraf-sa 
  apiGroup: rbac.authorization.k8s.io
roleRef:
  # "roleRef" specifies the binding to a Role / ClusterRole
  kind: ClusterRole #this must be Role or ClusterRole
  name: pod-reader # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io

sfitts · 2021-12-01T18:50:51Z

We had assumed it was probably something along those lines, but we have yet to find the incantation which allows it to work. Currently we have the following definition for the SA, Role, RoleBinding. Note that this configuration works fine on our K8s 1.18 based clusters.

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    meta.helm.sh/release-name: telegraf-prom
    meta.helm.sh/release-namespace: shared
  labels:
    app.kubernetes.io/instance: telegraf-prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    helm.sh/chart: telegraf-1.8.6
  name: telegraf-prom
  namespace: shared
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  annotations:
    meta.helm.sh/release-name: telegraf-prom
    meta.helm.sh/release-namespace: shared
  labels:
    app.kubernetes.io/instance: telegraf-prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    helm.sh/chart: telegraf-1.8.6
  name: telegraf-prom
  namespace: shared
  resourceVersion: "3529662"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: telegraf-prom
subjects:
- kind: ServiceAccount
  name: telegraf-prom
  namespace: shared
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  annotations:
    meta.helm.sh/release-name: telegraf-prom
    meta.helm.sh/release-namespace: shared
  labels:
    app.kubernetes.io/instance: telegraf-prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    helm.sh/chart: telegraf-1.8.6
  name: telegraf-prom
  namespace: shared
rules: null

On our 1.20 based EKS cluster the example you give above fails to apply with:

The RoleBinding "read-pods" is invalid: subjects[0].apiGroup: Unsupported value: "rbac.authorization.k8s.io": supported values: ""

Removing that key from the subjects section allows it to be applied. Unfortunately it produces the same result. We still see the (admittedly unhelpful) error message in the telegraf log.

It does seem to be permission related, but we've yet to find the grants that will make it happy (at least starting with K8s 1.20).

ahothan · 2021-12-02T01:11:19Z

sorry, the apiGroup line in kind service account in my example above needs to be removed.
Here is a config that was just tested and works with k8s 1.21:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: telegraf-k8s-role-{{.Release.Name}}
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
---
# Rolebinding for namespace to cluster-admin
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: telegraf-k8s-role-{{.Release.Name}}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: telegraf-k8s-role-{{.Release.Name}}
subjects:
- kind: ServiceAccount
  name: telegraf-k8s-{{ .Release.Name }}
  namespace: {{ .Release.Namespace }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf-k8s-{{ .Release.Name }}

(ClusterRole/Binding instead of Role/Binding is needed in case of cluster level pods watch)

ahothan · 2021-12-02T01:18:04Z

in your config, the rules is empty (null), have you tried adding the same rules as those defined in my config above?
You also use Role/RoleBinding with the namespace "shared", this will likely not allow to watch pods from other namespaces (fine if you only want to watch that namespace)

sfitts · 2021-12-02T01:20:59Z

in your config, the rules is empty (null), have you tried adding the same rules as those defined in my config above?

Yes, sorry I was unclear about that -- I replaced our definitions with the ones you originally provided and got the same error. However, we haven't tried the ones you just provided in your most recent update. But once we do we'll let you know the result. Thanks!

sfitts · 2021-12-02T17:42:51Z

Confirmed that the configuration above works. Looks like K8s actually started applying some security to the watch API (which is good). Thanks for the help in finding the correct permissions to use.

Address documentation gap

ahothan · 2021-12-03T15:36:50Z

cool! thanks for verifying.

sfitts added the bug unexpected problem or unintended behavior label Nov 10, 2021

telegraf-tiger bot added the area/discovery label Nov 10, 2021

powersj mentioned this issue Nov 10, 2021

fix: check error before defer in prometheus k8s #10091

Merged

3 tasks

ahothan added a commit to ahothan/telegraf that referenced this issue Dec 3, 2021

Resolves influxdata#10085

59f02e0

Address documentation gap

ahothan mentioned this issue Dec 3, 2021

docs: address documentation gap when running telegraf in k8s #10215

Merged

3 tasks

ahothan added a commit to ahothan/telegraf that referenced this issue Dec 3, 2021

Resolves influxdata#10085

69dcd7e

sspaink closed this as completed in #10215 Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inputs.prometheus] SIGSEGV on startup with Kubernetes 1.20 #10085

[inputs.prometheus] SIGSEGV on startup with Kubernetes 1.20 #10085

sfitts commented Nov 10, 2021

powersj commented Nov 10, 2021 •

edited

Loading

sfitts commented Nov 10, 2021

powersj commented Nov 10, 2021

sfitts commented Nov 10, 2021

sfitts commented Nov 12, 2021

ahothan commented Dec 1, 2021 •

edited

Loading

sfitts commented Dec 1, 2021

ahothan commented Dec 2, 2021

ahothan commented Dec 2, 2021

sfitts commented Dec 2, 2021

sfitts commented Dec 2, 2021

ahothan commented Dec 3, 2021

[inputs.prometheus] SIGSEGV on startup with Kubernetes 1.20 #10085

[inputs.prometheus] SIGSEGV on startup with Kubernetes 1.20 #10085

Comments

sfitts commented Nov 10, 2021

Relevent telegraf.conf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Nov 10, 2021 • edited Loading

sfitts commented Nov 10, 2021

powersj commented Nov 10, 2021

sfitts commented Nov 10, 2021

sfitts commented Nov 12, 2021

ahothan commented Dec 1, 2021 • edited Loading

sfitts commented Dec 1, 2021

ahothan commented Dec 2, 2021

ahothan commented Dec 2, 2021

sfitts commented Dec 2, 2021

sfitts commented Dec 2, 2021

ahothan commented Dec 3, 2021

powersj commented Nov 10, 2021 •

edited

Loading

ahothan commented Dec 1, 2021 •

edited

Loading