Skip to content

node_textfile_scrape_error not reported when failing to read metrics due to inconsistent HELP texts #2317

Closed
@pieter-lautus

Description

Host operating system:

Linux tau-gfa-uat 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version:

node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7)
build user: root@243aafa5525c
build date: 20211205-11:09:49
go version: go1.17.3
platform: linux/amd64

node_exporter command line flags

/usr/local/bin/node_exporter \
  --collector.disable-defaults \
  --collector.cpu \
  --collector.cpufreq \
  --collector.diskstats \
  --collector.edac \
  --collector.filefd \
  --collector.filesystem \
  --collector.filesystem.fs-types-exclude \
  '^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|fuse.*|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$' \
  --collector.hwmon \
  --collector.loadavg \
  --collector.mdadm \
  --collector.meminfo \
  --collector.netdev \
  --collector.netstat \
  --collector.pressure \
  --collector.processes \
  --collector.schedstat \
  --collector.sockstat \
  --collector.stat \
  --collector.textfile \
  --collector.textfile.directory \
  /var/lib/node_exporter_textfile \
  --collector.time \
  --collector.timex \
  --collector.vmstat \
  --web.disable-exporter-metrics \
  --web.listen-address \
  :9100

Are you running node_exporter in Docker?

No

What did you do that produced an error?

For technical reasons, different parts of our system create different files for the same metric, but different labels. Due to code divergence, the files had inconsistent HELP messages.

For example:

# cat tau_infrastructure_performing_maintenance_task_foo.prom 
# HELP tau_infrastructure_performing_maintenance_task The server is performing some long-running maintenance task
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="deployment", sub_task="deployment_ansible", start_or_stop="start"} 1645624007.0

and

# cat tau_infrastructure_performing_maintenance_task_bar.prom 
# HELP tau_infrastructure_performing_maintenance_task At what timestamp a given task started or stopped, the last time it was run.
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="nightly",sub_task="main",start_or_stop="start"} 1647280801.98446

What did you expect to see?

I would expect either:

  1. To get all metrics, with node_exporter arbitrarily choosing one of the two HELP texts where there is a difference, or,
  2. if some metrics are dropped, for node_textfile_scrape_error to return non-zero for this job

What did you see instead?

The node_exporter logs an error about the situation, but not all of the metrics are scraped by prometheus, but node_textfile_scrape_error is nevertheless zero despite there being an issue preventing metrics from being exported.

Mar 15 10:43:25 REDACTED node_exporter[733184]: ts=2022-03-15T08:43:25.153Z caller=stdlib.go:105 level=error msg="error gathering metrics: 4 error(s) occurred:\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"start\" > label:<name:\"sub_task\" value:\"main\" > gauge:<value:1.64728080198446e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"stop\" > label:<name:\"sub_task\" value:\"main\" > gauge:<value:1.64728240041946e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"start\" > label:<name:\"sub_task\" value:\"reporting\" > gauge:<value:1.64728080229161e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\"\n* [from Gatherer #2] collected metric tau_infrastructure_performing_maintenance_task label:<name:\"main_task\" value:\"nightly\" > label:<name:\"start_or_stop\" value:\"stop\" > label:<name:\"sub_task\" value:\"reporting\" > gauge:<value:1.64728239993993e+09 >  has help \"At what timestamp a given task started or stopped, the last time it was run.\" but should have \"The server is performing some long-running maintenance task\""

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions