OSD-8760: Improve bulk service log error handling #168

supreeth7 · 2021-12-06T05:52:19Z

Resolves: https://issues.redhat.com/browse/OSD-8760
Relates to: https://issues.redhat.com/browse/OSD-7691

This PR improves error handling when sending a service log to multiple clusters, if a service log failed to post to a single cluster it doesn't halt the command execution and continues to post service log updates to the remaining clusters.

This PR also refactors the service log post sub-command output.

When run servicelog post sub-command to post service logs to multiple clusters, the command output will list the clusters the service log was successfully posted along with a list of clusters which encountered errors while sending the service log.

$ osdctl servicelog post ...

INFO[0008] Success: 2, Failed: 1
                       
INFO[0008] Successful clusters:                         
ID                   Status
<clusterID>          Message has been successfully sent to <clusterID>
<clusterID>          Message has been successfully sent to <clusterID>

INFO[0008] Failed clusters:                             
ID                   Status
<clusterID>          Message sent, but wrong severity information was passed (wanted "high", got "Error")

fahlmant · 2021-12-07T13:36:09Z

/label tide/merge-method-squash

supreeth7 · 2021-12-08T04:05:58Z

It has been noticed that a service log is being sent to the same cluster multiple times, will work on a fix.

fahlmant · 2021-12-08T14:46:35Z

@supreeth7 what was the fix?

supreeth7 · 2021-12-08T14:50:23Z

@fahlmant It was a wrong pointer reference, a one-liner :)

fahlmant · 2021-12-08T16:14:56Z

@supreeth7 Awesome. Do you have any tests against stage clusters I can look at?

supreeth7 · 2021-12-08T17:02:15Z

@fahlmant I tested them on two stage clusters which I had created prior, here are the results:

./dist/osdctl_linux_amd64/osdctl servicelog post -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/cluster_has_gone_missing.json -p "CLUSTER_UUID=shepherd%"
INFO[0002] The following clusters match the given parameters: 
Name                ID                                 State               Version             Cloud Provider      Region
shepherd-1          1oudlp5pmjparuqfasf16pq8jtbtnptq   ready               4.9.8               aws                 ap-south-1
shepherd-2          1outpcau2bum4mtmkgj1m0q51r4qdunr   ready               4.9.9               aws                 ap-south-1

INFO[0002] The following template will be sent:         
{
  "severity": "Error",
  "service_name": "SREManualAction",
  "cluster_uuid": "${CLUSTER_UUID}",
  "summary": "Action required: cluster not checking in",
  "description": "Your cluster requires you to take action because it is no longer checking in with Red Hat OpenShift Cluster Manager. Possible causes include stopping instances or a networking misconfiguration. If you have stopped the cluster instances, please start them again - stopping instances is not supported. If you intended to terminate this cluster then please delete the cluster in the Red Hat console. Otherwise file a support request.",
  "internal_only": false,
  "event_stream_id": ""
}
Continue? (y/N): y
INFO[0016] Success: 2, Failed: 0
                       
INFO[0016] Successful clusters:                         
ID                                     Status
751e4669-857f-45ce-804f-ece43abbe91c   Message has been successfully sent to 751e4669-857f-45ce-804f-ece43abbe91c
93f72b85-e317-432e-8355-46885c38f68f   Message has been successfully sent to 93f72b85-e317-432e-8355-46885c38f68f

If there is an error in one of the clusters (as an example, here a cluster was in installing state):

❯ ./dist/osdctl_linux_amd64/osdctl servicelog post -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/cluster_has_gone_missing.json -p "CLUSTER_UUID=shepherd%"
INFO[0001] The following clusters match the given parameters: 
Name                ID                                 State               Version             Cloud Provider      Region
shepherd-1          1oudlp5pmjparuqfasf16pq8jtbtnptq   ready               4.9.8               aws                 ap-south-1
shepherd-2          1outpcau2bum4mtmkgj1m0q51r4qdunr   installing          4.9.9               aws                 ap-south-1

INFO[0001] The following template will be sent:         
{
  "severity": "Error",
  "service_name": "SREManualAction",
  "cluster_uuid": "${CLUSTER_UUID}",
  "summary": "Action required: cluster not checking in",
  "description": "Your cluster requires you to take action because it is no longer checking in with Red Hat OpenShift Cluster Manager. Possible causes include stopping instances or a networking misconfiguration. If you have stopped the cluster instances, please start them again - stopping instances is not supported. If you intended to terminate this cluster then please delete the cluster in the Red Hat console. Otherwise file a support request.",
  "internal_only": false,
  "event_stream_id": ""
}
Continue? (y/N): y
INFO[0003] Success: 1, Failed: 1
                       
INFO[0003] Successful clusters:                         
ID                                     Status
93f72b85-e317-432e-8355-46885c38f68f   Message has been successfully sent to 93f72b85-e317-432e-8355-46885c38f68f

INFO[0003] Failed clusters:                             
ID                                     Status
751e4669-857f-45ce-804f-ece43abbe91c   Account sbasabat.openshift denied access to perform create on ServiceLog with HTTP call POST /api/service_logs/v1/cluster_logs

I can confirm that the messaged clusters have service logs listed via osdctl list and console.

cmd/servicelog/common.go

Refactored servicelog list for better error handling minor refactor fixed multiple SL post to the same cluster comment changes, list error handling Refactor: validation functions return errors, eliminated use of global var

supreeth7 · 2021-12-10T10:28:30Z

/retest

wanghaoran1988 · 2021-12-14T04:28:00Z

/lgtm

wanghaoran1988 · 2021-12-14T04:29:37Z

@clcollins @fahlmant @georgettica @iamkirkbater Cloud one of you take another look and approve this?

georgettica · 2021-12-14T10:05:59Z

/approve
looks good overall.
I do still need to test the normal post command to make sure that's ok
/hold

georgettica · 2021-12-14T11:41:40Z

/approve
works :)

openshift-ci · 2021-12-14T11:43:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: georgettica, supreeth7, wanghaoran1988

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [georgettica]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

georgettica · 2021-12-14T13:08:13Z

/unhold

openshift-ci · 2021-12-14T13:13:26Z

@supreeth7: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Refactored servicelog list for better error handling minor refactor fixed multiple SL post to the same cluster comment changes, list error handling Refactor: validation functions return errors, eliminated use of global var

openshift-ci bot requested review from clcollins and sam-nguyen7 December 6, 2021 05:52

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 7, 2021

supreeth7 changed the title ~~OSD-8760: Improve bulk service log error handling~~ WIP - OSD-8760: Improve bulk service log error handling Dec 8, 2021

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2021

supreeth7 changed the title ~~WIP - OSD-8760: Improve bulk service log error handling~~ OSD-8760: Improve bulk service log error handling Dec 8, 2021

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2021

supreeth7 force-pushed the OSD-8760 branch from 90c6984 to 015b20c Compare December 8, 2021 14:13

wanghaoran1988 reviewed Dec 10, 2021

View reviewed changes

cmd/servicelog/common.go Outdated Show resolved Hide resolved

wanghaoran1988 reviewed Dec 10, 2021

View reviewed changes

cmd/servicelog/common.go Outdated Show resolved Hide resolved

wanghaoran1988 reviewed Dec 10, 2021

View reviewed changes

cmd/servicelog/common.go Outdated Show resolved Hide resolved

wanghaoran1988 reviewed Dec 10, 2021

View reviewed changes

cmd/servicelog/common.go Outdated Show resolved Hide resolved

Improved bulk service log error handling

8b89e88

Refactored servicelog list for better error handling minor refactor fixed multiple SL post to the same cluster comment changes, list error handling Refactor: validation functions return errors, eliminated use of global var

supreeth7 force-pushed the OSD-8760 branch from 015b20c to 8b89e88 Compare December 10, 2021 10:28

openshift-ci bot assigned wanghaoran1988 Dec 14, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2021

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 14, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2021

openshift-merge-robot merged commit 040133d into openshift:master Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSD-8760: Improve bulk service log error handling #168

OSD-8760: Improve bulk service log error handling #168

supreeth7 commented Dec 6, 2021

fahlmant commented Dec 7, 2021

supreeth7 commented Dec 8, 2021 •

edited

Loading

fahlmant commented Dec 8, 2021

supreeth7 commented Dec 8, 2021 •

edited

Loading

fahlmant commented Dec 8, 2021

supreeth7 commented Dec 8, 2021 •

edited

Loading

supreeth7 commented Dec 10, 2021

wanghaoran1988 commented Dec 14, 2021

wanghaoran1988 commented Dec 14, 2021

georgettica commented Dec 14, 2021

georgettica commented Dec 14, 2021

openshift-ci bot commented Dec 14, 2021

georgettica commented Dec 14, 2021

openshift-ci bot commented Dec 14, 2021

OSD-8760: Improve bulk service log error handling #168

OSD-8760: Improve bulk service log error handling #168

Conversation

supreeth7 commented Dec 6, 2021

fahlmant commented Dec 7, 2021

supreeth7 commented Dec 8, 2021 • edited Loading

fahlmant commented Dec 8, 2021

supreeth7 commented Dec 8, 2021 • edited Loading

fahlmant commented Dec 8, 2021

supreeth7 commented Dec 8, 2021 • edited Loading

supreeth7 commented Dec 10, 2021

wanghaoran1988 commented Dec 14, 2021

wanghaoran1988 commented Dec 14, 2021

georgettica commented Dec 14, 2021

georgettica commented Dec 14, 2021

openshift-ci bot commented Dec 14, 2021

georgettica commented Dec 14, 2021

openshift-ci bot commented Dec 14, 2021

supreeth7 commented Dec 8, 2021 •

edited

Loading

supreeth7 commented Dec 8, 2021 •

edited

Loading

supreeth7 commented Dec 8, 2021 •

edited

Loading