Skip to content

OADP-6294: Mod-work for the OADP Troubleshooting user story #95005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,89 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[]

toc::[]

You might encounter these common issues with `Backup` and `Restore` custom resources (CRs).
You might encounter the following common issues with `Backup` and `Restore` custom resources (CRs):

[id="backup-cannot-retrieve-volume_{context}"]
== Backup CR cannot retrieve volume
* Backup CR cannot retrieve volume
* Backup CR status remains in progress
* Backup CR status remains in PartiallyFailed

The `Backup` CR displays the following error message: `InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist`.
include::modules/troubleshooting-backup-cr-cannot-retrieve-volume-issue.adoc[leveloffset=+1]

.Cause
include::modules/troubleshooting-backup-cr-status-remains-in-progress-issue.adoc[leveloffset=+1]

The persistent volume (PV) and the snapshot locations are in different regions.

.Solution

. Edit the value of the `spec.snapshotLocations.velero.config.region` key in the `DataProtectionApplication` manifest so that the snapshot location is in the same region as the PV.
. Create a new `Backup` CR.

[id="backup-cr-remains-in-progress_{context}"]
== Backup CR status remains in progress

The status of a `Backup` CR remains in the `InProgress` phase and does not complete.

.Cause

If a backup is interrupted, it cannot be resumed.

.Solution

. Retrieve the details of the `Backup` CR by running the following command:
+
[source,terminal]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
backup describe <backup>
----

. Delete the `Backup` CR by running the following command:
+
[source,terminal]
----
$ oc delete backups.velero.io <backup> -n openshift-adp
----
+
You do not need to clean up the backup location because an in progress `Backup` CR has not uploaded files to object storage.

. Create a new `Backup` CR.

. View the Velero backup details by running the following command:
+
[source,terminal, subs="+quotes"]
----
$ velero backup describe _<backup-name>_ --details
----

[id="backup-cr-remains-partiallyfailed_{context}"]
== Backup CR status remains in PartiallyFailed

The status of a `Backup` CR without Restic in use remains in the `PartiallyFailed` phase and is not completed. A snapshot of the affiliated PVC is not created.

.Cause

If the backup created based on the CSI snapshot class is missing a label, the CSI snapshot plugin fails to create a snapshot. As a result, the `Velero` pod logs an error similar to the following message:

[source,text]
----
time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq
----

.Solution

. Delete the `Backup` CR by running the following command::
+
[source,terminal]
----
$ oc delete backups.velero.io <backup> -n openshift-adp
----

. If required, clean up the stored data on the `BackupStorageLocation` to free up space.

. Apply the label `velero.io/csi-volumesnapshot-class=true` to the `VolumeSnapshotClass` object by running the following command:
+
[source,terminal]
----
$ oc label volumesnapshotclass/<snapclass_name> velero.io/csi-volumesnapshot-class=true
----

. Create a new `Backup` CR.
include::modules/troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue.adoc[leveloffset=+1]
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,8 @@ include::_attributes/attributes-openshift-dedicated.adoc[]

toc::[]

If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources.
If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources. The values for the resource request fields must follow the same format as Kubernetes resource requirements.

The values for the resource request fields must follow the same format as Kubernetes resource requirements.
If you do not specify `configuration.velero.podConfig.resourceAllocations` or `configuration.restic.podConfig.resourceAllocations`, see the following default `resources` specification configuration for a Velero or Restic pod:

[source,yaml]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,82 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[]

toc::[]

You might encounter these issues when you back up applications with Restic.
You might encounter the following issues when you back up applications with Restic:

[id="restic-permission-error-nfs-root-squash-enabled_{context}"]
== Restic permission error for NFS data volumes with root_squash enabled
* Restic permission error for NFS data volumes with `root_squash` enabled
* Restic Backup CR cannot be recreated after bucket is emptied
* Restic restore partially failing on OCP 4.14 due to changed PSA policy

The `Restic` pod log displays the following error message: `controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied"`.
include::modules/restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled.adoc[leveloffset=+1]

.Cause

If your NFS data volumes have `root_squash` enabled, `Restic` maps to `nfsnobody` and does not have permission to create backups.

.Solution

You can resolve this issue by creating a supplemental group for `Restic` and adding the group ID to the `DataProtectionApplication` manifest:

. Create a supplemental group for `Restic` on the NFS data volume.
. Set the `setgid` bit on the NFS directories so that group ownership is inherited.
. Add the `spec.configuration.nodeAgent.supplementalGroups` parameter and the group ID to the `DataProtectionApplication` manifest, as shown in the following example:
+
[source,yaml]
----
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
# ...
spec:
configuration:
nodeAgent:
enable: true
uploaderType: restic
supplementalGroups:
- <group_id> <1>
# ...
----
<1> Specify the supplemental group ID.

. Wait for the `Restic` pods to restart so that the changes are applied.

[id="restic-backup-cannot-be-recreated-after-s3-bucket-emptied_{context}"]
== Restic Backup CR cannot be recreated after bucket is emptied

If you create a Restic `Backup` CR for a namespace, empty the object storage bucket, and then recreate the `Backup` CR for the same namespace, the recreated `Backup` CR fails.

The `velero` pod log displays the following error message: `stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?`.

.Cause

Velero does not recreate or update the Restic repository from the `ResticRepository` manifest if the Restic directories are deleted from object storage. See link:https://github.com/vmware-tanzu/velero/issues/4421[Velero issue 4421] for more information.

.Solution

* Remove the related Restic repository from the namespace by running the following command:
+
[source,terminal]
----
$ oc delete resticrepository openshift-adp <name_of_the_restic_repository>
----
+

In the following error log, `mysql-persistent` is the problematic Restic repository. The name of the repository appears in italics for clarity.
+
[source,text,options="nowrap",subs="+quotes,verbatim"]
----
time="2021-12-29T18:29:14Z" level=info msg="1 errors
encountered backup up item" backup=velero/backup65
logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds
time="2021-12-29T18:29:14Z" level=error msg="Error backing up item"
backup=velero/backup65 error="pod volume backup failed: error running
restic backup, stderr=Fatal: unable to open config file: Stat: The
specified key does not exist.\nIs there a repository at the following
location?\ns3:http://minio-minio.apps.mayap-oadp-
veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/
restic/_mysql-persistent_\n: exit status 1" error.file="/remote-source/
src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184"
error.function="github.com/vmware-tanzu/velero/
pkg/restic.(*backupper).BackupPodVolumes"
logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds
----
include::modules/restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied.adoc[leveloffset=+1]

include::modules/oadp-restic-restore-failing-psa-policy.adoc[leveloffset=+1]
96 changes: 42 additions & 54 deletions modules/migration-debugging-velero-resources.adoc
Original file line number Diff line number Diff line change
@@ -1,99 +1,87 @@
// Module included in the following assemblies:
//
// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
// * backup_and_restore/application_backup_and_restore/troubleshooting/velero-cli-tool.adoc
// * migrating_from_ocp_3_to_4/troubleshooting-3-4.adoc
// * migration_toolkit_for_containers/troubleshooting-mtc

[id="migration-debugging-velero-resources_{context}"]
= Debugging Velero resources with the Velero CLI tool

You can debug `Backup` and `Restore` custom resources (CRs) and retrieve logs with the Velero CLI tool.
You can debug `Backup` and `Restore` custom resources (CRs) and retrieve logs with the Velero CLI tool. The Velero CLI tool provides more detailed information than the OpenShift CLI tool.

The Velero CLI tool provides more detailed information than the OpenShift CLI tool.

[discrete]
[id="velero-command-syntax_{context}"]
== Syntax

Use the `oc exec` command to run a Velero CLI command:
.Procedure

* Use the `oc exec` command to run a Velero CLI command:
+
[source,terminal,subs="attributes+"]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> <command> <cr_name>
----

.Example
+
.Example for the `oc exec` command
[source,terminal,subs="attributes+"]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
----

[discrete]
[id="velero-help-option_{context}"]
== Help option

Use the `velero --help` option to list all Velero CLI commands:

* List all Velero CLI commands by using the following `velero --help` option:
+
[source,terminal,subs="attributes+"]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
--help
----


[discrete]
[id="velero-describe-command_{context}"]
== Describe command

Use the `velero describe` command to retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR:

* Retrieve the logs of a `Backup` or `Restore` CR by using the following `velero logs` command:
+
[source,terminal,subs="attributes+"]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> describe <cr_name>
<backup_restore_cr> logs <cr_name>
----

.Example
+
.Example for the `velero logs` command
[source,terminal,subs="attributes+"]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf
----

The following types of restore errors and warnings are shown in the output of a `velero describe` request:

* `Velero`: A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on
* `Cluster`: A list of messages related to backing up or restoring cluster-scoped resources
* `Namespaces`: A list of list of messages related to backing up or restoring resources stored in namespaces

One or more errors in one of these categories results in a `Restore` operation receiving the status of `PartiallyFailed` and not `Completed`. Warnings do not lead to a change in the completion status.

[IMPORTANT]
====
* For resource-specific errors, that is, `Cluster` and `Namespaces` errors, the `restore describe --details` output includes a resource list that lists all resources that Velero succeeded in restoring. For any resource that has such an error, check to see if the resource is actually in the cluster.

* If there are `Velero` errors, but no resource-specific errors, in the output of a `describe` command, it is possible that the restore completed without any actual problems in restoring workloads, but carefully validate post-restore applications.
* Retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR by using the following `velero describe` command:
+
For example, if the output contains `PodVolumeRestore` or node agent-related errors, check the status of `PodVolumeRestores` and `DataDownloads`. If none of these are failed or still running, then volume data might have been fully restored.
====

[discrete]
[id="velero-logs-command_{context}"]
== Logs command

Use the `velero logs` command to retrieve the logs of a `Backup` or `Restore` CR:

[source,terminal,subs="attributes+"]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> logs <cr_name>
<backup_restore_cr> describe <cr_name>
----

.Example
+
.Example for the `velero describe` command
[source,terminal,subs="attributes+"]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
----
+
The following types of restore errors and warnings are shown in the output of a `velero describe` request:
+
.`Velero`
A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on
+
.`Cluster`
A list of messages related to backing up or restoring cluster-scoped resources
+
.`Namespaces`
A list of list of messages related to backing up or restoring resources stored in namespaces

+
One or more errors in one of these categories results in a `Restore` operation receiving the status of `PartiallyFailed` and not `Completed`. Warnings do not lead to a change in the completion status.
+
Consider the following points for these restore errors:

* For resource-specific errors, that is, `Cluster` and `Namespaces` errors, the `restore describe --details` output includes a resource list that lists all resources that Velero succeeded in restoring. For any resource that has such an error, check if the resource is actually in the cluster.

* If there are `Velero` errors, but no resource-specific errors, in the output of a `describe` command, it is possible that the restore completed without any actual problems in restoring workloads, but carefully validate post-restore applications.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break this into at least two sentences. It will make the content easier to understand. Also, I was a tad confused by "but no resource-specific errors, in the output of a describe command," The comma in-between threw me off.

+
For example, if the output contains `PodVolumeRestore` or node agent-related errors, check the status of `PodVolumeRestores` and `DataDownloads`. If none of these are failed or still running, then volume data might have been fully restored.
13 changes: 6 additions & 7 deletions modules/oadp-creating-alerting-rule.adoc
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
// Module included in the following assemblies:
//
// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-monitoring.adoc

:_mod-docs-content-type: PROCEDURE
[id="creating-alerting-rules_{context}"]
= Creating an alerting rule

The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.
The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics, which are scraped with the user workload monitoring.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change:
To create an Alerting rule for the OADP project, use one of the Metrics, which are scraped with the user workload monitoring.

To:
To create an Alerting rule for the OADP project, use one of the Metrics scraped with the user workload monitoring.

Up to you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics, which are scraped with the user workload monitoring.
The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the {oadp-short} project, use one of the Metrics scraped with the user workload monitoring.


.Procedure

. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`.
. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`:
+
.Sample `OADPBackupFailing` alert
[source,yaml]
Expand Down Expand Up @@ -40,7 +40,7 @@ In this sample, the Alert displays under the following conditions:
+
* There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.
* If the time of the first increase is less than 5 minutes, the Alert will be in a `Pending` state, after which it will turn into a `Firing` state.
+

. Apply the `4_create_oadp_alert_rule.yaml` file, which creates the `PrometheusRule` object in the `openshift-adp` namespace:
+
[source,terminal]
Expand All @@ -55,12 +55,11 @@ prometheusrule.monitoring.coreos.com/sample-oadp-alert created
----

.Verification

* After the Alert is triggered, you can view it in the following ways:
** In the *Developer* perspective, select the *Observe* menu.
** In the *Administrator* perspective under the *Observe* -> *Alerting* menu, select *User* in the *Filter* box. Otherwise, by default only the *Platform* Alerts are displayed.
+
.OADP backup failing alert

image::oadp-backup-failing-alert.png[OADP backup failing alert]


image::oadp-backup-failing-alert.png[OADP backup failing alert]
Loading