hostmetrics receiver duplicates filesystem metrics on GKE #34512

tcolgate · 2024-08-08T12:39:35Z

Component(s)

receiver/hostmetrics

What happened?

Description

When running in GKE system.filesystem.inodes.usage and system.filesystem.usage duplicate metrics for
mountpoint=/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet, along with other pod specific mountpoints under...

/home/kubernetes/containerized_mounter/rootfs/
/var/lib/kubelet/pods/
/var/lib/kubelet/plugins/
Not all pods have the duplicated data, it appears to be more prevalent on pods that are using CSI plugins.

Steps to Reproduce

Expected Result

Metrics should be collected without duplicates.

Actual Result

One of the detected mountpoints appears twice in the metrics. This then causes issues when metrics are passed to external metrics providers like Google.

...
Descriptor:
     -> Name: system.filesystem.usage                                                                            
     -> Description: Filesystem bytes used.
     -> Unit: By
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> device: Str(/dev/dm-0)
     -> mode: Str(ro)
     -> mountpoint: Str(/)
     -> type: Str(ext2)
     -> state: Str(used)
...
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #42
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 7855972352
NumberDataPoints #43
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(free)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 93331124224
NumberDataPoints #44
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(reserved)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #45
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
...
Descriptor:
     -> Name: system.filesystem.usage
     -> Description: Filesystem bytes used.
     -> Unit: By
     -> DataType: Sum
     -> IsMonotonic: false                                                                                       
     -> AggregationTemporality: Cumulative
...
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #42
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 7855972352
NumberDataPoints #43
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(free)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 93331124224
NumberDataPoints #44
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(reserved)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #45
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 7855972352
NumberDataPoints #46
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(free)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 93331124224
NumberDataPoints #47
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(reserved)

when coupled with the googlemanagedprometheus exporter we get the following error:

{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-08-07T16:21:42.771Z	error	exporterhelper/queue_sender.go:90	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlemanagedprometheus", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[77] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[31] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[30] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[79] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[78] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.\nerror details: name = Unknown  desc = total_point_count:200  success_point_count:195  errors:{status:{code:3}  point_count:5}", "dropped_items": 287}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.104.0/exporterhelper/queue_sender.go:90
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/consumers.go:43

Collector version

otelcol-contrib version 0.105.0

Environment information

Environment

OS: google container OS
Compiler: official docker container image ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib@sha256:3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea88114

OpenTelemetry Collector configuration

# slightly trimmed down
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-node-test
  namespace: kube-system
spec:
  args:
    feature-gates: exporter.googlemanagedpromethues.intToDouble,-component.UseLocalHostAsDefaultHost
  config:
    exporters:
      debug: {}
    processors:
      batch:
        send_batch_max_size: 11000
        send_batch_size: 10000
        timeout: 5s
      k8sattributes:
        auth_type: serviceAccount
        extract:
          metadata:
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.namespace.name
          - k8s.node.name
          - k8s.pod.start_time
          - k8s.container.name
        filter:
          node_from_env_var: NODE_NAME
        passthrough: false
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
          - from: resource_attribute
            name: k8s.namespace.name
          - from: resource_attribute
            name: k8s.pod.name
          - from: resource_attribute
            name: k8s.container.name
        - sources:
          - from: connection
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
      resource:
        attributes:
        - action: insert
          key: environment
          value: staging
        - action: insert
          key: k8s.node.name
          value: ${env:NODE_NAME}
        - action: insert
          key: k8s.namespace.name
          value: ${env:NAMESPACE}
      resource/hostmetrics:
        attributes:
        - action: insert
          key: job
          value: otel-node-collector
        - action: insert
          key: namespace
          value: ${env:NAMESPACE}
      resourcedetection/gcp:
        detectors:
        - gcp
        override: false
        timeout: 2s
      transform/hostmetrics:
        error_mode: ignore
        metric_statements:
        - context: resource
          statements:
          - set(attributes["node"], attributes["k8s.node.name"])
          - set(attributes["pod"], attributes["k8s.pod.name"])
          - set(attributes["container"], attributes["k8s.container.name"])
      transform/metrics:
        metric_statements:
        - context: datapoint
          statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
    receivers:
      hostmetrics:
        collection_interval: 10s
        root_path: /hostfs
        scrapers:
          cpu: null
          disk: null
          filesystem: null
          load: null
          memory: null
          network: null
    service:
      pipelines:
        metrics/hostmetrics:
          exporters:
          - debug
          processors:
          - resource/hostmetrics
          - resourcedetection/gcp
          - resource
          - filter/noiseymetrics
          - transform/hostmetrics
          receivers:
          - hostmetrics
          - kubeletstats
  daemonSetUpdateStrategy: {}
  deploymentUpdateStrategy: {}
  env:
  - name: POD_IP
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: status.podIP
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: spec.nodeName
  - name: NAMESPACE
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.namespace
  image: europe-west6-docker.pkg.dev/cerbos-registry/spitfire/imported/ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib@sha256:3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea88114
  ingress:
    route: {}
  ipFamilyPolicy: SingleStack
  managementState: managed
  mode: daemonset
  observability:
    metrics: {}
  podDisruptionBudget:
    maxUnavailable: 1
  podDnsConfig: {}
  priorityClassName: system-node-critical
  replicas: 1
  resources: {}
  securityContext:
    runAsGroup: 0
    runAsUser: 0
  serviceAccount: kube-system-otel
  tolerations:
  - effect: NoSchedule
    operator: Exists
  upgradeStrategy: automatic
  volumeMounts:
  - mountPath: /var/lib/otelcol
    name: varlibotelcol
  - mountPath: /etc/prometheus/certs
    name: tls-assets
    readOnly: true
  - mountPath: /hostfs
    mountPropagation: HostToContainer
    name: hostfs
    readOnly: true
  volumes:
  - hostPath:
      path: /var/lib/otelcol
      type: DirectoryOrCreate
    name: varlibotelcol
  - name: tls-assets
    projected:
      defaultMode: 420
      sources:
      - secret:
          name: prometheus-otel-prom-config-tls-assets-0
  - hostPath:
      path: /
    name: hostfs
status:
  image: europe-west6-docker.pkg.dev/cerbos-registry/spitfire/imported/ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib@sha256:3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea88114
  scale:
    selector: app.kubernetes.io/component=opentelemetry-collector,app.kubernetes.io/instance=kube-system.otel-node,app.kubernetes.io/managed-by=opentelemetry-operator,app.kubernetes.io/name=otel-node-collector,app.kubernetes.io/part-of=opentelemetry,app.kubernetes.io/version=3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea8811
  version: 0.105.0

Log output

See "what happened"

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-08T12:39:53Z

Pinging code owners:

receiver/hostmetrics: @dmitryax @braydonk

See Adding Labels via Comments if you do not have permissions to add labels yourself.

tcolgate · 2024-08-13T07:28:51Z

By way of further debugging. Checking /proc/1/mountinfo (used by the imported shirou/goputi), looking for one of the duplicating .../globalmount mountpoints, we see

/ # grep 095e/globalmount /proc/1/mountinfo
10048 9964 8:64 / /hostfs/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw
10238 10091 8:64 / /hostfs/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw
10415 10288 8:64 / /hostfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw
10582 10282 8:64 / /hostfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw

The mounts are paths that are mounted to the same location but (I think), under different namespaces, presumably the same data into two different pods.

I think it would be valid to one export the metrics once per unique path (they should all have the same filesystem level metrics). Though, equally, it's not obvious that metrics for these mounts are useful at all.

I'm working around the issue locally by dropping metrics for these paths, (there's a good chance I'd drop these anyway, they aren't terribly useful), but fixing the duplication in the hostmetrics receiver seems fair.

Mountpoints can be reported multiple times for each mount into a namespace. This causes duplicate metrics which causes issues with some exporters. Each instance of the mountpoint will have identical metrics, so it is safe to ignore repeated mountpoints. Closes open-telemetry#34512

…elemetry#34635) Mountpoints can be reported multiple times for each mount into a namespace. This causes duplicate metrics which causes issues with some exporters. Each instance of the mountpoint will have identical metrics, so it is safe to ignore repeated mountpoints. Closes open-telemetry#34512 **Description:** <Describe what has changed.>  **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.>

tcolgate added bug Something isn't working needs triage New item requiring triage labels Aug 8, 2024

github-actions bot added the receiver/hostmetrics label Aug 8, 2024

github-actions bot mentioned this issue Aug 13, 2024

Weekly Report: 2024-08-06 - 2024-08-13 #34626

Closed

tcolgate mentioned this issue Aug 13, 2024

fix(hostmetricsreceiver): do not duplicate mountpoint metrics #34635

Merged

tcolgate added a commit to tcolgate/opentelemetry-collector-contrib that referenced this issue Aug 15, 2024

chore(changelog): entry for open-telemetry#34635 (open-telemetry#34512)

8e22df8

github-actions bot mentioned this issue Aug 20, 2024

Weekly Report: 2024-08-13 - 2024-08-20 #34743

Closed

This was referenced Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Closed

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Closed

This was referenced Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

atoulme removed the needs triage New item requiring triage label Oct 2, 2024

andrzej-stencel closed this as completed in #34635 Oct 9, 2024

andrzej-stencel closed this as completed in a2b67d9 Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hostmetrics receiver duplicates filesystem metrics on GKE #34512

hostmetrics receiver duplicates filesystem metrics on GKE #34512

tcolgate commented Aug 8, 2024 •

edited

Loading

github-actions bot commented Aug 8, 2024

tcolgate commented Aug 13, 2024

hostmetrics receiver duplicates filesystem metrics on GKE #34512

hostmetrics receiver duplicates filesystem metrics on GKE #34512

Comments

tcolgate commented Aug 8, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Aug 8, 2024

tcolgate commented Aug 13, 2024

tcolgate commented Aug 8, 2024 •

edited

Loading