Cluster check issue with older clusters #35

kajla · 2022-09-02T15:35:48Z

Hello,

We are using this script to monitor Rancher clusters.
We upgraded it to the latest version (check_rancher2 v 1.9.0 (c) 2018-2022).
We have noticed that the cluster check also throws an error for previously included clusters:

jq: error (at :1): Cannot iterate over null (null)
jq: error (at :1): Cannot iterate over null (null) CHECK_RANCHER2 OK - Cluster vh-rke is healthy

On newly registered RKE2 clusters, runs fine.
We noticed a small thing, that the cluster ID of the cluster with a good check is prefixed with 'c-m-', while the cluster that throws an error simply has the prefix 'c-' without 'm'.
Without cluster ID, we get as many errors as we have clusters.. x2. Because a cluster throws 2 errors.
We have got 4 older cluster and 2 newer cluster, so there is (4x2) 8 error message:

$ /usr/lib/nagios/plugins/check_rancher2.sh -H 'rancher.mgmt' -P 'xxxxxxxxxx' -S -U 'token-yyyyy' -t 'cluster'
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
CHECK_RANCHER2 OK - All clusters (6) are healthy|'clusters_total'=6;;;; 'clusters_errors'=0;;;;

Cluster IDs:

$ /usr/lib/nagios/plugins/check_rancher2.sh -H 'rancher.mgmt' -P 'xxxxxxxxxxxxxxxxxxxxxxx' -S -U 'token-yyyyyy' -t 'info'
CHECK_RANCHER2 OK - Found 6 clusters: c-m-ppqqlmn9 alias bkp-avsz-rke - c-m-vhlmbdms alias prod-avsz-rke - c-nvqh7 alias tst-avsz-rke - c-qvgjw alias vh-rke - c-s7vbs alias tst-vh-rke - local alias rke-rancher -

I ran the script in debug, so I found which part throws the error:
1.

+ clusterstate=active
+ component=($(echo "$api_out_single_cluster" | jq -r '.componentStatuses[].name'))
++ jq -r '.componentStatuses[].name'

+ declare -a component
+ healthstatus=($(echo "$api_out_single_cluster" | jq -r '.componentStatuses[].conditions[].status'))
++ jq -r '.componentStatuses[].conditions[].status'

Could you fix it, please?

Thank you in advance.

Regards,
Adam

The text was updated successfully, but these errors were encountered:

Napsty · 2022-09-07T09:43:33Z

I would have to see the content of $api_out_single_cluster to see what actually throws the error. Can you please mail me the full output of the plugin using bash -xv ? -> https://www.claudiokuenzler.com/about/

kajla · 2022-09-07T11:26:22Z

Yes, I can. I sent the asked output by e-mail.
Small Note that it seems to have something to do with the Kubernetes version. In two steps, I upgraded v1.21 cluster to v1.22, then continued with v1.23.
It's still good on v1.21, but not anymore on v1.23.
Thank you.

Napsty · 2022-09-09T05:49:45Z

Background story: ComponentStatus health checks are deprecated since Kubernetes 1.19:
kubernetes/enhancements#553
kubernetes/kubernetes#93570
rancher/rancher#11496

Older Rancher versions (using the old but nicer User Interface) heavily relied on these ComponentStatus values:
rancher/rancher#11496

See also https://www.claudiokuenzler.com/blog/1049/rancher2-kubernetes-cluster-errors-alerts-controller-manager-scheduler-deep-dive for a deeper dive into these cluster components. Already when I wrote that article back in February 2021, I knew that this will potentially have an impact on the monitoring check:

Depending on how this is implemented on Rancher's side, will also affect the check_rancher2 monitoring plugin.

Napsty · 2022-09-09T07:37:44Z

@kajla can you try with https://raw.githubusercontent.com/Napsty/check_rancher2/issue-35/check_rancher2.sh (branch issue-35) please?

kajla · 2022-09-09T08:55:14Z

Thank you.
It is good with a single cluster, but unfortunately it is still bad without a specific cluster.

That's fixed:

$ bash /usr/lib/nagios/plugins/check_rancher2-1.9.1.sh -H rancher.mgmt -P yyyyyyyyyyyyyyyyyyy -S -U token-xxxxx -t cluster -c c-nvvh7
CHECK_RANCHER2 OK - Cluster tst-avsz-rke (v1.23.10+rke2r1) is healthy|'cluster_healthy'=1;;;; 'component_errors'=0;;;; 'cpu'=29280;;;;52000 'memory'=29114536832B;;;0;102540578816 'pods'=185;;;;880 'usage_cpu'=56%;;;0;100 'usage_memory'=28%;;;0;100 'usage_pods'=21%;;;0;100

That's unfixed:

$ bash /usr/lib/nagios/plugins/check_rancher2-1.9.1.sh -H rancher.mgmt -P yyyyyyyyyyyyyyyyyy -S -U token-xxxx -t cluster
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
jq: error (at <stdin>:1): Cannot iterate over null (null)
CHECK_RANCHER2 OK - All clusters (6) are healthy|'clusters_total'=6;;;; 'clusters_errors'=0;;;;

I will also send the debug output of this.
Thank you in advance.

Napsty · 2022-09-09T09:00:23Z

Just made another push, please refresh and download https://raw.githubusercontent.com/Napsty/check_rancher2/issue-35/check_rancher2.sh again. Was missing the fix in the multi-cluster check.

kajla · 2022-09-09T09:38:16Z

It's good now, thanks :)

Napsty · 2022-09-09T09:43:58Z

Thanks for testing and reporting!

kajla · 2022-09-09T10:03:23Z

You welcome, I think it can be closed :)

* Fix ComponentStatus (#35), show K8s version in single cluster check * Ignoring statuses in workload check * Fix ComponentStatus (#35)

Napsty self-assigned this Sep 9, 2022

Napsty added the bug Something isn't working label Sep 9, 2022

Napsty added a commit that referenced this issue Sep 9, 2022

Fix ComponentStatus (#35), show K8s version in single cluster check

2efa3b0

Napsty added this to the 1.10.0 milestone Sep 9, 2022

Napsty mentioned this issue Sep 9, 2022

PR for 1.10.0 #36

Merged

Napsty added a commit that referenced this issue Sep 9, 2022

Fix ComponentStatus (#35)

14b4aa5

kajla closed this as completed Sep 9, 2022

Napsty added a commit that referenced this issue Sep 9, 2022

PR for 1.10.0 (#36)

a3bcdd3

* Fix ComponentStatus (#35), show K8s version in single cluster check * Ignoring statuses in workload check * Fix ComponentStatus (#35)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster check issue with older clusters #35

Cluster check issue with older clusters #35

kajla commented Sep 2, 2022 •

edited

Loading

Napsty commented Sep 7, 2022

kajla commented Sep 7, 2022

Napsty commented Sep 9, 2022

Napsty commented Sep 9, 2022

kajla commented Sep 9, 2022

Napsty commented Sep 9, 2022

kajla commented Sep 9, 2022

Napsty commented Sep 9, 2022

kajla commented Sep 9, 2022

Cluster check issue with older clusters #35

Cluster check issue with older clusters #35

Comments

kajla commented Sep 2, 2022 • edited Loading

Napsty commented Sep 7, 2022

kajla commented Sep 7, 2022

Napsty commented Sep 9, 2022

Napsty commented Sep 9, 2022

kajla commented Sep 9, 2022

Napsty commented Sep 9, 2022

kajla commented Sep 9, 2022

Napsty commented Sep 9, 2022

kajla commented Sep 9, 2022

kajla commented Sep 2, 2022 •

edited

Loading