-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Velero restore stuck during Kopia restoration #7901
Comments
I'm putting the complete logs after the timeout, with the restore not having restored all disks content, with status PartiallyFailed. |
Looks like there are 17 pods, restores among pods are processed in sequence by the PVR. The default timeout for these pods together is 4 hour. From the log, timeout happened:
Therefore, you can increase the timeout(set |
Hi, I don't think the timeout is the cause of the issue, it's a consequence of something blocking the process. After a dozen of minutes, it was stuck in this state and nothing changed during the next 3 hours. |
Do you know what are these pods (oomie-* and fluent-bit-*) and is there any other controller operating on the pods besides Velero? It seem that these pods disappeared after Velero restore created them.
|
oomie and fluentbit pods are daemonset pods. There is one pod per node. We use cluster autoscaler in our clusters. |
I think that is the cause. Details: Below are some of the incompleted PVRs from the attached velero bundler:
|
As a best practice, you should keep a static env when Velero restore is running |
Thanks a lot for these hints. I'll try to look if I can remove daemonset pods from the backup, and keep you posted. |
This problem happens when the pod is deleted before the PVR gets to InProgress status; if the pod is deleted after the PVB/PVR gets to InProgress status, the data restore will fail since the cannot access the volume, so the PVR will fail directly. During restore, if the pod is not found, at present, we don't treat it as and error but retry get it forever. We may need to change this behavior, e.g., add something like a max retry count. |
It's a valid enhancement that we trigger a warning for failure when such race condition happen, rather than stuck till timeout. With that said this is a relatively corner case, let me keep this in backlog and we can revisit it if we have extra bandwidth in v1.15 development. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
I am experiencing same when trying to restore vault. Vault is running, but restore is in
Also note that i set |
What steps did you take and what happened:
I took a backup with only FSB. It works fine.
I tried to restore on another cluster: restore is stuck at some point.
I'm using AWS EKS 1.29
What did you expect to happen:
Restore to complete
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
bundle-2024-06-17-13-28-28.tar.gz
Anything else you would like to add:
Environment:
velero version
): 1.13.2velero client config get features
): features:kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: