Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD-8760: Improve bulk service log error handling #168

Merged
merged 1 commit into from
Dec 14, 2021

Conversation

supreeth7
Copy link
Contributor

Resolves: https://issues.redhat.com/browse/OSD-8760
Relates to: https://issues.redhat.com/browse/OSD-7691

This PR improves error handling when sending a service log to multiple clusters, if a service log failed to post to a single cluster it doesn't halt the command execution and continues to post service log updates to the remaining clusters.

This PR also refactors the service log post sub-command output.


When run servicelog post sub-command to post service logs to multiple clusters, the command output will list the clusters the service log was successfully posted along with a list of clusters which encountered errors while sending the service log.

$ osdctl servicelog post ...

INFO[0008] Success: 2, Failed: 1
                       
INFO[0008] Successful clusters:                         
ID                   Status
<clusterID>          Message has been successfully sent to <clusterID>
<clusterID>          Message has been successfully sent to <clusterID>

INFO[0008] Failed clusters:                             
ID                   Status
<clusterID>          Message sent, but wrong severity information was passed (wanted "high", got "Error")

@fahlmant
Copy link
Contributor

fahlmant commented Dec 7, 2021

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 7, 2021
@supreeth7 supreeth7 changed the title OSD-8760: Improve bulk service log error handling WIP - OSD-8760: Improve bulk service log error handling Dec 8, 2021
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2021
@supreeth7
Copy link
Contributor Author

supreeth7 commented Dec 8, 2021

It has been noticed that a service log is being sent to the same cluster multiple times, will work on a fix.

@supreeth7 supreeth7 changed the title WIP - OSD-8760: Improve bulk service log error handling OSD-8760: Improve bulk service log error handling Dec 8, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2021
@fahlmant
Copy link
Contributor

fahlmant commented Dec 8, 2021

@supreeth7 what was the fix?

@supreeth7
Copy link
Contributor Author

supreeth7 commented Dec 8, 2021

@fahlmant It was a wrong pointer reference, a one-liner :)

@fahlmant
Copy link
Contributor

fahlmant commented Dec 8, 2021

@supreeth7 Awesome. Do you have any tests against stage clusters I can look at?

@supreeth7
Copy link
Contributor Author

supreeth7 commented Dec 8, 2021

@fahlmant I tested them on two stage clusters which I had created prior, here are the results:

./dist/osdctl_linux_amd64/osdctl servicelog post -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/cluster_has_gone_missing.json -p "CLUSTER_UUID=shepherd%"
INFO[0002] The following clusters match the given parameters: 
Name                ID                                 State               Version             Cloud Provider      Region
shepherd-1          1oudlp5pmjparuqfasf16pq8jtbtnptq   ready               4.9.8               aws                 ap-south-1
shepherd-2          1outpcau2bum4mtmkgj1m0q51r4qdunr   ready               4.9.9               aws                 ap-south-1

INFO[0002] The following template will be sent:         
{
  "severity": "Error",
  "service_name": "SREManualAction",
  "cluster_uuid": "${CLUSTER_UUID}",
  "summary": "Action required: cluster not checking in",
  "description": "Your cluster requires you to take action because it is no longer checking in with Red Hat OpenShift Cluster Manager. Possible causes include stopping instances or a networking misconfiguration. If you have stopped the cluster instances, please start them again - stopping instances is not supported. If you intended to terminate this cluster then please delete the cluster in the Red Hat console. Otherwise file a support request.",
  "internal_only": false,
  "event_stream_id": ""
}
Continue? (y/N): y
INFO[0016] Success: 2, Failed: 0
                       
INFO[0016] Successful clusters:                         
ID                                     Status
751e4669-857f-45ce-804f-ece43abbe91c   Message has been successfully sent to 751e4669-857f-45ce-804f-ece43abbe91c
93f72b85-e317-432e-8355-46885c38f68f   Message has been successfully sent to 93f72b85-e317-432e-8355-46885c38f68f

If there is an error in one of the clusters (as an example, here a cluster was in installing state):

❯ ./dist/osdctl_linux_amd64/osdctl servicelog post -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/cluster_has_gone_missing.json -p "CLUSTER_UUID=shepherd%"
INFO[0001] The following clusters match the given parameters: 
Name                ID                                 State               Version             Cloud Provider      Region
shepherd-1          1oudlp5pmjparuqfasf16pq8jtbtnptq   ready               4.9.8               aws                 ap-south-1
shepherd-2          1outpcau2bum4mtmkgj1m0q51r4qdunr   installing          4.9.9               aws                 ap-south-1

INFO[0001] The following template will be sent:         
{
  "severity": "Error",
  "service_name": "SREManualAction",
  "cluster_uuid": "${CLUSTER_UUID}",
  "summary": "Action required: cluster not checking in",
  "description": "Your cluster requires you to take action because it is no longer checking in with Red Hat OpenShift Cluster Manager. Possible causes include stopping instances or a networking misconfiguration. If you have stopped the cluster instances, please start them again - stopping instances is not supported. If you intended to terminate this cluster then please delete the cluster in the Red Hat console. Otherwise file a support request.",
  "internal_only": false,
  "event_stream_id": ""
}
Continue? (y/N): y
INFO[0003] Success: 1, Failed: 1
                       
INFO[0003] Successful clusters:                         
ID                                     Status
93f72b85-e317-432e-8355-46885c38f68f   Message has been successfully sent to 93f72b85-e317-432e-8355-46885c38f68f

INFO[0003] Failed clusters:                             
ID                                     Status
751e4669-857f-45ce-804f-ece43abbe91c   Account sbasabat.openshift denied access to perform create on ServiceLog with HTTP call POST /api/service_logs/v1/cluster_logs

I can confirm that the messaged clusters have service logs listed via osdctl list and console.

Refactored servicelog list for better error handling

minor refactor

fixed multiple SL post to the same cluster

comment changes, list error handling

Refactor: validation functions return errors, eliminated use of global var
@supreeth7
Copy link
Contributor Author

/retest

@wanghaoran1988
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2021
@wanghaoran1988
Copy link
Member

@clcollins @fahlmant @georgettica @iamkirkbater Cloud one of you take another look and approve this?

@georgettica
Copy link
Contributor

/approve
looks good overall.
I do still need to test the normal post command to make sure that's ok
/hold

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 14, 2021
@georgettica
Copy link
Contributor

/approve
works :)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 14, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: georgettica, supreeth7, wanghaoran1988

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@georgettica
Copy link
Contributor

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 14, 2021

@supreeth7: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 040133d into openshift:master Dec 14, 2021
devppratik pushed a commit to devppratik/osdctl that referenced this pull request Aug 23, 2023
Refactored servicelog list for better error handling

minor refactor

fixed multiple SL post to the same cluster

comment changes, list error handling

Refactor: validation functions return errors, eliminated use of global var
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants