Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features #304

Merged
merged 12 commits into from
Nov 20, 2024

Conversation

nutellinoit
Copy link
Member

@nutellinoit nutellinoit commented Nov 16, 2024

This PR introduces several enhancements and new features to the Disaster Recovery (DR) module, including improvements to schedule customization, new options for snapshot data movement, and the ability to optionally install a snapshot-controller. It also includes a breaking schema change to improve usability. Detailed explanations and examples are provided below.

Checklist for Testing 🧪

  • Backup and restore with Restic using the previous version, and restored using the new dr module version with Kopia set as the default uploader.
  • Backup and restore with Kopia post-upgrade.
  • Backup and restore with snapshots using Ceph as the CSI storage, verifying that with deletionPolicy: Retain, snapshots can still be restored even if the namespace is deleted.
  • Backup and restore with snapshotMoveData enabled, verifying that a backup performed on a cluster, after destroying and recreating the cluster, can be restored by pointing Velero to the same object storage.
    • EDIT 1: currently facing this issue on restoring: Fail get DataUploadResult for restore - multiple DataUpload result cms found  vmware-tanzu/velero#7057 (comment)
    • EDIT 2: Only the first backup taken with snapshotMoveData can be restored automatically, marked the parameter as experimental
    • EDIT 3: Restoring manually with kopia repository connect s3 --bucket=velero --prefix=kopia/<namespacename>/ --endpoint="yourendpoint" --access-key=xxxx --secret-access-key=xxxx --region=xxxx --disable-tls , using password static-passw0rd and kopia snapshot list --all -l and finally kopia snapshot restore ke7c2317576cfc5336a966ee040d68a3d /path-to-restore-to can be used to restore data. NOTE: You need kopia cli installed
    • EDIT 4: If you do a manual backup not starting from the schedule full, and you restore it it works even for subsequent backups. Maybe we need to change the description of the field adding some caveats (@luigidematteis is trying to reproduce the issue I found, so maybe we can fix it)

Details

  • DR improved configurable schedules:
    The schedule configuration has been updated to enhance the usability of schedule customization (note: this is a breaking change):

    ...
      dr:
        velero:
          schedules:
            install: true
            definitions:
              manifests:
                schedule: "*/15 * * * *"
                ttl: "720h0m0s"
              full:
                schedule: "0 1 * * *"
                ttl: "720h0m0s"
                snapshotMoveData: false
    ...
  • DR snapshotMoveData options for full schedule:
    A new parameter has been introduced in the Velero full schedule to enable the snapshotMoveData feature. This feature allows data captured from a snapshot to be copied to the object storage location.
    Important: Enabling this parameter will cause Velero to upload all data from the snapshotted volumes to S3 using Kopia. While backups are deduplicated, significant storage usage is still expected. To enable this parameter in the full schedule:

    ...
      dr:
        velero:
          schedules:
            install: true
            definitions:
              full:
                snapshotMoveData: false
    ...

    General example to enable Volume Snapshotting on rook-ceph (from our storage add-on module):

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotClass
    metadata:
      name: velero-snapclass
      labels:
        velero.io/csi-volumesnapshot-class: "true"
      driver: rook-ceph.rbd.csi.ceph.com
      parameters:
        clusterID: rook-ceph
        csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
      deletionPolicy: Retain

    deletionPolicy: Retain is important because if the volume snapshot is deleted from the namespace, the cluster-wide VolumeSnapshotContent CR will be preserved, maintaining the snapshot on the storage that the cluster is using.

  • DR optional snapshot-controller installation:
    To leverage VolumeSnapshots on the OnPremises and KFDDistribution providers, a new option on Velero has been added to install the snapshot-controller component.
    Important: Before activating this parameter, make sure that there is no other snapshot-controller component deployed in your cluster. By default, this parameter is false.

    ...
      dr:
        velero:
          snapshotController:
            install: true
    ...

Breaking Changes 💔

  • DR Schema change:
    A new format for the schedule customization has been introduced to improve usability.

…templates to include snapshot-controller deployment, and enabling volume snapshotting on OnPremises and KFDDistribution providers. Complete refactor of velero schedules settings, simplifying and improving the definition
…apshotMoveData field. Adding also snapshotController installation flag
…, since EKS will install it's own with the aws module
@luigidematteis
Copy link

luigidematteis commented Nov 17, 2024

@nutellino maybe the problem is due to the fact that we are backing up all Velero's resources, which does not seem to be recommended:

https://github.com/vmware-tanzu/velero/issues/7057#issuecomment-1794110211


I haven't found any specific directive about this in Velero docs, actually.


On the other hand, I see that there are some MushHave resources needed to successfully restore a backup:


vmware-tanzu/velero#6709

and from what I understand, DataUpload is needed for restoring volumes.


I think it would require further investigation to clearly understand what happened, and what we can do

@nutellinoit
Copy link
Member Author

@nutellino maybe the problem is due to the fact that we are backing up all Velero's resources, which does not seem to be recommended:

https://github.com/vmware-tanzu/velero/issues/7057#issuecomment-1794110211


I haven't found any specific directive about this in Velero docs, actually.


On the other hand, I see that there are some MushHave resources needed to successfully restore a backup:


vmware-tanzu/velero#6709

and from what I understand, DataUpload is needed for restoring volumes.


I think it would require further investigation to clearly understand what happened, and what we can do

No I did some more testing and the issue arise when there are more than one backup for the same volume in kopia. I added some more edits with the workaround in the issue description

@nutellinoit
Copy link
Member Author

Update:

If you create a backup without starting from the full schedule, for example with:

velero backup create <backup-name> --snapshot-move-data --include-namespaces <namespace-name>

All the restore from backups even subsequent ones are working fine. The only issue arise if you do a backup starting from the schedule full

@luigidematteis is trying to find the reason behind this

@ralgozino ralgozino changed the title Feat: update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features Nov 19, 2024
@nutellinoit nutellinoit merged commit 62b9e47 into feat/release-v1.30.0 Nov 20, 2024
1 check failed
@nutellinoit nutellinoit deleted the feat/update-dr-to-v3.0.0-rc branch November 20, 2024 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants