-
Notifications
You must be signed in to change notification settings - Fork 1.8k
OADP-6294: Mod-work for the OADP Troubleshooting user story #95005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,99 +1,87 @@ | ||
// Module included in the following assemblies: | ||
// | ||
// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc | ||
// * backup_and_restore/application_backup_and_restore/troubleshooting/velero-cli-tool.adoc | ||
// * migrating_from_ocp_3_to_4/troubleshooting-3-4.adoc | ||
// * migration_toolkit_for_containers/troubleshooting-mtc | ||
|
||
[id="migration-debugging-velero-resources_{context}"] | ||
= Debugging Velero resources with the Velero CLI tool | ||
|
||
You can debug `Backup` and `Restore` custom resources (CRs) and retrieve logs with the Velero CLI tool. | ||
You can debug `Backup` and `Restore` custom resources (CRs) and retrieve logs with the Velero CLI tool. The Velero CLI tool provides more detailed information than the OpenShift CLI tool. | ||
|
||
The Velero CLI tool provides more detailed information than the OpenShift CLI tool. | ||
|
||
[discrete] | ||
[id="velero-command-syntax_{context}"] | ||
== Syntax | ||
|
||
Use the `oc exec` command to run a Velero CLI command: | ||
.Procedure | ||
|
||
* Use the `oc exec` command to run a Velero CLI command: | ||
+ | ||
[source,terminal,subs="attributes+"] | ||
---- | ||
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ | ||
<backup_restore_cr> <command> <cr_name> | ||
---- | ||
|
||
.Example | ||
+ | ||
.Example for the `oc exec` command | ||
[source,terminal,subs="attributes+"] | ||
---- | ||
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ | ||
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql | ||
---- | ||
|
||
[discrete] | ||
[id="velero-help-option_{context}"] | ||
== Help option | ||
|
||
Use the `velero --help` option to list all Velero CLI commands: | ||
|
||
* List all Velero CLI commands by using the following `velero --help` option: | ||
+ | ||
[source,terminal,subs="attributes+"] | ||
---- | ||
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ | ||
--help | ||
---- | ||
|
||
|
||
[discrete] | ||
[id="velero-describe-command_{context}"] | ||
== Describe command | ||
|
||
Use the `velero describe` command to retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR: | ||
|
||
* Retrieve the logs of a `Backup` or `Restore` CR by using the following `velero logs` command: | ||
+ | ||
[source,terminal,subs="attributes+"] | ||
---- | ||
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ | ||
<backup_restore_cr> describe <cr_name> | ||
<backup_restore_cr> logs <cr_name> | ||
---- | ||
|
||
.Example | ||
+ | ||
.Example for the `velero logs` command | ||
[source,terminal,subs="attributes+"] | ||
---- | ||
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ | ||
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql | ||
restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf | ||
---- | ||
|
||
The following types of restore errors and warnings are shown in the output of a `velero describe` request: | ||
|
||
* `Velero`: A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on | ||
* `Cluster`: A list of messages related to backing up or restoring cluster-scoped resources | ||
* `Namespaces`: A list of list of messages related to backing up or restoring resources stored in namespaces | ||
|
||
One or more errors in one of these categories results in a `Restore` operation receiving the status of `PartiallyFailed` and not `Completed`. Warnings do not lead to a change in the completion status. | ||
|
||
[IMPORTANT] | ||
==== | ||
* For resource-specific errors, that is, `Cluster` and `Namespaces` errors, the `restore describe --details` output includes a resource list that lists all resources that Velero succeeded in restoring. For any resource that has such an error, check to see if the resource is actually in the cluster. | ||
|
||
* If there are `Velero` errors, but no resource-specific errors, in the output of a `describe` command, it is possible that the restore completed without any actual problems in restoring workloads, but carefully validate post-restore applications. | ||
* Retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR by using the following `velero describe` command: | ||
+ | ||
For example, if the output contains `PodVolumeRestore` or node agent-related errors, check the status of `PodVolumeRestores` and `DataDownloads`. If none of these are failed or still running, then volume data might have been fully restored. | ||
==== | ||
|
||
[discrete] | ||
[id="velero-logs-command_{context}"] | ||
== Logs command | ||
|
||
Use the `velero logs` command to retrieve the logs of a `Backup` or `Restore` CR: | ||
|
||
[source,terminal,subs="attributes+"] | ||
---- | ||
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ | ||
<backup_restore_cr> logs <cr_name> | ||
<backup_restore_cr> describe <cr_name> | ||
---- | ||
|
||
.Example | ||
+ | ||
.Example for the `velero describe` command | ||
[source,terminal,subs="attributes+"] | ||
---- | ||
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ | ||
restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf | ||
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql | ||
---- | ||
+ | ||
The following types of restore errors and warnings are shown in the output of a `velero describe` request: | ||
+ | ||
.`Velero` | ||
A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on | ||
+ | ||
.`Cluster` | ||
A list of messages related to backing up or restoring cluster-scoped resources | ||
+ | ||
.`Namespaces` | ||
A list of list of messages related to backing up or restoring resources stored in namespaces | ||
|
||
+ | ||
One or more errors in one of these categories results in a `Restore` operation receiving the status of `PartiallyFailed` and not `Completed`. Warnings do not lead to a change in the completion status. | ||
+ | ||
Consider the following points for these restore errors: | ||
|
||
* For resource-specific errors, that is, `Cluster` and `Namespaces` errors, the `restore describe --details` output includes a resource list that lists all resources that Velero succeeded in restoring. For any resource that has such an error, check if the resource is actually in the cluster. | ||
|
||
* If there are `Velero` errors, but no resource-specific errors, in the output of a `describe` command, it is possible that the restore completed without any actual problems in restoring workloads, but carefully validate post-restore applications. | ||
+ | ||
For example, if the output contains `PodVolumeRestore` or node agent-related errors, check the status of `PodVolumeRestores` and `DataDownloads`. If none of these are failed or still running, then volume data might have been fully restored. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,16 +1,16 @@ | ||||||
// Module included in the following assemblies: | ||||||
// | ||||||
// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc | ||||||
// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-monitoring.adoc | ||||||
|
||||||
:_mod-docs-content-type: PROCEDURE | ||||||
[id="creating-alerting-rules_{context}"] | ||||||
= Creating an alerting rule | ||||||
|
||||||
The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring. | ||||||
The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics, which are scraped with the user workload monitoring. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe change: To: Up to you! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
.Procedure | ||||||
|
||||||
. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`. | ||||||
. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`: | ||||||
+ | ||||||
.Sample `OADPBackupFailing` alert | ||||||
[source,yaml] | ||||||
|
@@ -40,7 +40,7 @@ In this sample, the Alert displays under the following conditions: | |||||
+ | ||||||
* There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes. | ||||||
* If the time of the first increase is less than 5 minutes, the Alert will be in a `Pending` state, after which it will turn into a `Firing` state. | ||||||
+ | ||||||
|
||||||
. Apply the `4_create_oadp_alert_rule.yaml` file, which creates the `PrometheusRule` object in the `openshift-adp` namespace: | ||||||
+ | ||||||
[source,terminal] | ||||||
|
@@ -55,12 +55,11 @@ prometheusrule.monitoring.coreos.com/sample-oadp-alert created | |||||
---- | ||||||
|
||||||
.Verification | ||||||
|
||||||
* After the Alert is triggered, you can view it in the following ways: | ||||||
** In the *Developer* perspective, select the *Observe* menu. | ||||||
** In the *Administrator* perspective under the *Observe* -> *Alerting* menu, select *User* in the *Filter* box. Otherwise, by default only the *Platform* Alerts are displayed. | ||||||
+ | ||||||
.OADP backup failing alert | ||||||
|
||||||
image::oadp-backup-failing-alert.png[OADP backup failing alert] | ||||||
|
||||||
|
||||||
image::oadp-backup-failing-alert.png[OADP backup failing alert] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would break this into at least two sentences. It will make the content easier to understand. Also, I was a tad confused by "but no resource-specific errors, in the output of a
describe
command," The comma in-between threw me off.