Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero restore stuck during Kopia restoration #7901

Open
agateaux opened this issue Jun 17, 2024 · 12 comments
Open

Velero restore stuck during Kopia restoration #7901

agateaux opened this issue Jun 17, 2024 · 12 comments

Comments

@agateaux
Copy link

agateaux commented Jun 17, 2024

What steps did you take and what happened:

I took a backup with only FSB. It works fine.
I tried to restore on another cluster: restore is stuck at some point.

I'm using AWS EKS 1.29

What did you expect to happen:
Restore to complete

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
bundle-2024-06-17-13-28-28.tar.gz

Anything else you would like to add:

Environment:

  • Velero version (use velero version): 1.13.2
  • Velero features (use velero client config get features): features:
  • Kubernetes version (use kubectl version):
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.5-eks-49c6de4
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@agateaux
Copy link
Author

I'm putting the complete logs after the timeout, with the restore not having restored all disks content, with status PartiallyFailed.
bundle-2024-06-17-15-09-45.tar.gz

@Lyndon-Li
Copy link
Contributor

Looks like there are 17 pods, restores among pods are processed in sequence by the PVR. The default timeout for these pods together is 4 hour. From the log, timeout happened:

time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois
time="2024-06-17T19:08:13Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/trois

Therefore, you can increase the timeout(set --pod-volume-operation-timeout during install) value or switch to CSI snapshot data movement backup which runs in parallel among volumes.

@agateaux
Copy link
Author

Hi,
Thanks for the quick answer.

I don't think the timeout is the cause of the issue, it's a consequence of something blocking the process.

After a dozen of minutes, it was stuck in this state and nothing changed during the next 3 hours.

@Lyndon-Li
Copy link
Contributor

Do you know what are these pods (oomie-* and fluent-bit-*) and is there any other controller operating on the pods besides Velero? It seem that these pods disappeared after Velero restore created them.

time="2024-06-17T15:12:03Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-qtshz\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:04Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-wp4tr\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:04Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-2jjt9\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:04Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"oomie-fdwm9\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:02Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-4l6rw\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:02Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-llkhj\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:02Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-fpdvq\" not found" logSource="pkg/podvolume/restorer.go:212"
time="2024-06-17T15:12:03Z" level=error msg="Failed to check node-agent pod status, disengage" error="pods \"fluent-bit-6tcmd\" not found" logSource="pkg/podvolume/restorer.go:212"

@agateaux
Copy link
Author

oomie and fluentbit pods are daemonset pods. There is one pod per node.

We use cluster autoscaler in our clusters.
The original cluster has 4 nodes running and the new cluster has 2 nodes until the next scale-up trigger.
That's why these pods got removed automatically.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 20, 2024

I think that is the cause. Details:
When Velero restore creates the pod, it runs an init-container to detect the pod's status, when the pod is restored completed and in running status, the init-container notifies Velero to start restore the data. When the data restore completes it marks the PVR as completed.
However, since the pod is deleted by a 3rd party, the PVR is not completed for ever.

Below are some of the incompleted PVRs from the attached velero bundler:

            "spec": {
                "backupStorageLocation": "default",
                "pod": {
                    "kind": "Pod",
                    "name": "oomie-qtshz",
                    "namespace": "plotly-system",
                    "uid": "9d1c9527-4976-47ab-adf3-5baf002e78b1"
                },
                "repoIdentifier": "",
                "snapshotID": "7922513112356b90cc92d3ae4b8e84bb",
                "sourceNamespace": "plotly-system",
                "uploaderSettings": {
                    "WriteSparseFiles": "false"
                },
                "uploaderType": "kopia",
                "volume": "istio-envoy"
            },
            "status": {
                "progress": {}
            }
            "spec": {
                "backupStorageLocation": "default",
                "pod": {
                    "kind": "Pod",
                    "name": "fluent-bit-llkhj",
                    "namespace": "plotly-system",
                    "uid": "1d01876d-df63-4e63-97dc-6e028ad9c48c"
                },
                "repoIdentifier": "",
                "snapshotID": "b1491cf0cb992866d426b6ed210377e9",
                "sourceNamespace": "plotly-system",
                "uploaderSettings": {
                    "WriteSparseFiles": "false"
                },
                "uploaderType": "kopia",
                "volume": "istio-envoy"
            },
            "status": {
                "progress": {}
            }

@Lyndon-Li
Copy link
Contributor

As a best practice, you should keep a static env when Velero restore is running

@agateaux
Copy link
Author

Thanks a lot for these hints.

I'll try to look if I can remove daemonset pods from the backup, and keep you posted.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 21, 2024

This problem happens when the pod is deleted before the PVR gets to InProgress status; if the pod is deleted after the PVB/PVR gets to InProgress status, the data restore will fail since the cannot access the volume, so the PVR will fail directly.

During restore, if the pod is not found, at present, we don't treat it as and error but retry get it forever. We may need to change this behavior, e.g., add something like a max retry count.

@reasonerjt
Copy link
Contributor

It's a valid enhancement that we trigger a warning for failure when such race condition happen, rather than stuck till timeout.

With that said this is a relatively corner case, let me keep this in backlog and we can revisit it if we have extra bandwidth in v1.15 development.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@mvalahtv
Copy link

mvalahtv commented Dec 2, 2024

I am experiencing same when trying to restore vault. Vault is running, but restore is in InProgress and hangs forever

$ velero restore describe --details restore-vault
Name:         restore-vault
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:                                 InProgress
Estimated total items to be restored:  59
Items restored so far:                 59

Started:    2024-12-02 16:18:21 +0100 CET
Completed:  <n/a>

Backup:  backup-vault-20

Namespaces:
  Included:  vault
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Or label selector:  <none>

Restore PVs:  auto

kopia Restores:
  Completed:
    vault/vault-hashicorp-0: audit, data, home
    vault/vault-hashicorp-1: audit, data, home
    vault/vault-hashicorp-2: audit, data, home
  New:
    vault/vault-hashicorp-csi-provider-5kgmx: agent-unix-socket
    vault/vault-hashicorp-csi-provider-hv2ms: agent-unix-socket
    vault/vault-hashicorp-csi-provider-s88kd: agent-unix-socket
    vault/vault-hashicorp-csi-provider-t2mvx: agent-unix-socket
    vault/vault-hashicorp-csi-provider-wh55m: agent-unix-socket

Existing Resource Policy:   <none>
ItemOperationTimeout:       5m0s

Preserve Service NodePorts:  auto

Uploader config:


Resource List:  <restore resource list not found>

Also note that i set ItemOperationTimeout: 5m0s and it was not finished. I could cleanup restore only by doing velero uninstall . The vault/vault-hashicorp-csi-provider-5kgmx: agent-unix-socket is simply emptyDir. Very strange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants