Velero restore stuck during Kopia restoration #7901

agateaux · 2024-06-17T17:31:20Z

What steps did you take and what happened:

I took a backup with only FSB. It works fine.
I tried to restore on another cluster: restore is stuck at some point.

I'm using AWS EKS 1.29

What did you expect to happen:
Restore to complete

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
bundle-2024-06-17-13-28-28.tar.gz

Anything else you would like to add:

Environment:

Velero version (use velero version): 1.13.2
Velero features (use velero client config get features): features:
Kubernetes version (use kubectl version):

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.5-eks-49c6de4

Kubernetes installer & version:
Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

agateaux · 2024-06-17T19:11:02Z

I'm putting the complete logs after the timeout, with the restore not having restored all disks content, with status PartiallyFailed.
bundle-2024-06-17-15-09-45.tar.gz

Lyndon-Li · 2024-06-18T03:25:49Z

Looks like there are 17 pods, restores among pods are processed in sequence by the PVR. The default timeout for these pods together is 4 hour. From the log, timeout happened:

time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois

Therefore, you can increase the timeout(set --pod-volume-operation-timeout during install) value or switch to CSI snapshot data movement backup which runs in parallel among volumes.

agateaux · 2024-06-18T13:06:31Z

Hi,
Thanks for the quick answer.

I don't think the timeout is the cause of the issue, it's a consequence of something blocking the process.

After a dozen of minutes, it was stuck in this state and nothing changed during the next 3 hours.

Lyndon-Li · 2024-06-19T07:51:04Z

Do you know what are these pods (oomie-* and fluent-bit-*) and is there any other controller operating on the pods besides Velero? It seem that these pods disappeared after Velero restore created them.

time="2024-06-17T15:12:03Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-qtshz\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:04Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-wp4tr\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:04Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-2jjt9\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:04Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-fdwm9\" not found" logSource="pkg/podvolume/restorer.go:212"

time="2024-06-17T15:12:02Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-4l6rw\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:02Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-llkhj\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:02Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-fpdvq\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:03Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-6tcmd\" not found" logSource="pkg/podvolume/restorer.go:212"

agateaux · 2024-06-19T14:57:28Z

oomie and fluentbit pods are daemonset pods. There is one pod per node.

We use cluster autoscaler in our clusters.
The original cluster has 4 nodes running and the new cluster has 2 nodes until the next scale-up trigger.
That's why these pods got removed automatically.

Lyndon-Li · 2024-06-20T02:35:15Z

I think that is the cause. Details:
When Velero restore creates the pod, it runs an init-container to detect the pod's status, when the pod is restored completed and in running status, the init-container notifies Velero to start restore the data. When the data restore completes it marks the PVR as completed.
However, since the pod is deleted by a 3rd party, the PVR is not completed for ever.

Below are some of the incompleted PVRs from the attached velero bundler:

            "spec": {
                "backupStorageLocation": "default",
                "pod": {
                    "kind": "Pod",
                    "name": "oomie-qtshz",
                    "namespace": "plotly-system",
                    "uid": "9d1c9527-4976-47ab-adf3-5baf002e78b1"
                },
                "repoIdentifier": "",
                "snapshotID": "7922513112356b90cc92d3ae4b8e84bb",
                "sourceNamespace": "plotly-system",
                "uploaderSettings": {
                    "WriteSparseFiles": "false"
                },
                "uploaderType": "kopia",
                "volume": "istio-envoy"
            },
            "status": {
                "progress": {}
            }

            "spec": {
                "backupStorageLocation": "default",
                "pod": {
                    "kind": "Pod",
                    "name": "fluent-bit-llkhj",
                    "namespace": "plotly-system",
                    "uid": "1d01876d-df63-4e63-97dc-6e028ad9c48c"
                },
                "repoIdentifier": "",
                "snapshotID": "b1491cf0cb992866d426b6ed210377e9",
                "sourceNamespace": "plotly-system",
                "uploaderSettings": {
                    "WriteSparseFiles": "false"
                },
                "uploaderType": "kopia",
                "volume": "istio-envoy"
            },
            "status": {
                "progress": {}
            }

Lyndon-Li · 2024-06-20T02:37:56Z

As a best practice, you should keep a static env when Velero restore is running

agateaux · 2024-06-20T21:18:14Z

Thanks a lot for these hints.

I'll try to look if I can remove daemonset pods from the backup, and keep you posted.

Lyndon-Li · 2024-06-21T02:58:37Z

This problem happens when the pod is deleted before the PVR gets to InProgress status; if the pod is deleted after the PVB/PVR gets to InProgress status, the data restore will fail since the cannot access the volume, so the PVR will fail directly.

During restore, if the pod is not found, at present, we don't treat it as and error but retry get it forever. We may need to change this behavior, e.g., add something like a max retry count.

reasonerjt · 2024-06-21T08:48:58Z

It's a valid enhancement that we trigger a warning for failure when such race condition happen, rather than stuck till timeout.

With that said this is a relatively corner case, let me keep this in backlog and we can revisit it if we have extra bandwidth in v1.15 development.

github-actions · 2024-08-21T01:53:59Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

mvalahtv · 2024-12-02T15:35:30Z

I am experiencing same when trying to restore vault. Vault is running, but restore is in InProgress and hangs forever

$ velero restore describe --details restore-vault
Name:         restore-vault
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:                                 InProgress
Estimated total items to be restored:  59
Items restored so far:                 59

Started:    2024-12-02 16:18:21 +0100 CET
Completed:  <n/a>

Backup:  backup-vault-20

Namespaces:
  Included:  vault
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Or label selector:  <none>

Restore PVs:  auto

kopia Restores:
  Completed:
    vault/vault-hashicorp-0: audit, data, home
    vault/vault-hashicorp-1: audit, data, home
    vault/vault-hashicorp-2: audit, data, home
  New:
    vault/vault-hashicorp-csi-provider-5kgmx: agent-unix-socket
    vault/vault-hashicorp-csi-provider-hv2ms: agent-unix-socket
    vault/vault-hashicorp-csi-provider-s88kd: agent-unix-socket
    vault/vault-hashicorp-csi-provider-t2mvx: agent-unix-socket
    vault/vault-hashicorp-csi-provider-wh55m: agent-unix-socket

Existing Resource Policy:   <none>
ItemOperationTimeout:       5m0s

Preserve Service NodePorts:  auto

Uploader config:


Resource List:  <restore resource list not found>

Also note that i set ItemOperationTimeout: 5m0s and it was not finished. I could cleanup restore only by doing velero uninstall . The vault/vault-hashicorp-csi-provider-5kgmx: agent-unix-socket is simply emptyDir. Very strange.

Lyndon-Li added PodVolume area/fs-backup labels Jun 21, 2024

Lyndon-Li self-assigned this Jun 21, 2024

Lyndon-Li added the 1.15-candidate label Jun 21, 2024

reasonerjt added 2024 Q2 reviewed backlog and removed 1.15-candidate labels Jun 21, 2024

github-actions bot added the staled label Aug 21, 2024

Lyndon-Li removed the staled label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero restore stuck during Kopia restoration #7901

Velero restore stuck during Kopia restoration #7901

agateaux commented Jun 17, 2024 •

edited

Loading

agateaux commented Jun 17, 2024

Lyndon-Li commented Jun 18, 2024

agateaux commented Jun 18, 2024

Lyndon-Li commented Jun 19, 2024

agateaux commented Jun 19, 2024

Lyndon-Li commented Jun 20, 2024 •

edited

Loading

Lyndon-Li commented Jun 20, 2024

agateaux commented Jun 20, 2024

Lyndon-Li commented Jun 21, 2024 •

edited

Loading

reasonerjt commented Jun 21, 2024

github-actions bot commented Aug 21, 2024

mvalahtv commented Dec 2, 2024

Velero restore stuck during Kopia restoration #7901

Velero restore stuck during Kopia restoration #7901

Comments

agateaux commented Jun 17, 2024 • edited Loading

agateaux commented Jun 17, 2024

Lyndon-Li commented Jun 18, 2024

agateaux commented Jun 18, 2024

Lyndon-Li commented Jun 19, 2024

agateaux commented Jun 19, 2024

Lyndon-Li commented Jun 20, 2024 • edited Loading

Lyndon-Li commented Jun 20, 2024

agateaux commented Jun 20, 2024

Lyndon-Li commented Jun 21, 2024 • edited Loading

reasonerjt commented Jun 21, 2024

github-actions bot commented Aug 21, 2024

mvalahtv commented Dec 2, 2024

agateaux commented Jun 17, 2024 •

edited

Loading

Lyndon-Li commented Jun 20, 2024 •

edited

Loading

Lyndon-Li commented Jun 21, 2024 •

edited

Loading