feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features #304

nutellinoit · 2024-11-16T16:55:47Z

This PR introduces several enhancements and new features to the Disaster Recovery (DR) module, including improvements to schedule customization, new options for snapshot data movement, and the ability to optionally install a snapshot-controller. It also includes a breaking schema change to improve usability. Detailed explanations and examples are provided below.

Checklist for Testing 🧪

Backup and restore with Restic using the previous version, and restored using the new dr module version with Kopia set as the default uploader.
Backup and restore with Kopia post-upgrade.
Backup and restore with snapshots using Ceph as the CSI storage, verifying that with deletionPolicy: Retain, snapshots can still be restored even if the namespace is deleted.
Backup and restore with snapshotMoveData enabled, verifying that a backup performed on a cluster, after destroying and recreating the cluster, can be restored by pointing Velero to the same object storage.
- EDIT 1: currently facing this issue on restoring: Fail get DataUploadResult for restore - multiple DataUpload result cms found vmware-tanzu/velero#7057 (comment)
- EDIT 2: Only the first backup taken with snapshotMoveData can be restored automatically, marked the parameter as experimental
- EDIT 3: Restoring manually with kopia repository connect s3 --bucket=velero --prefix=kopia/<namespacename>/ --endpoint="yourendpoint" --access-key=xxxx --secret-access-key=xxxx --region=xxxx --disable-tls , using password static-passw0rd and kopia snapshot list --all -l and finally kopia snapshot restore ke7c2317576cfc5336a966ee040d68a3d /path-to-restore-to can be used to restore data. NOTE: You need kopia cli installed
- EDIT 4: If you do a manual backup not starting from the schedule full, and you restore it it works even for subsequent backups. Maybe we need to change the description of the field adding some caveats (@luigidematteis is trying to reproduce the issue I found, so maybe we can fix it)

Details

DR improved configurable schedules:
The schedule configuration has been updated to enhance the usability of schedule customization (note: this is a breaking change):

...
  dr:
    velero:
      schedules:
        install: true
        definitions:
          manifests:
            schedule: "*/15 * * * *"
            ttl: "720h0m0s"
          full:
            schedule: "0 1 * * *"
            ttl: "720h0m0s"
            snapshotMoveData: false
...

DR snapshotMoveData options for full schedule:
A new parameter has been introduced in the Velero full schedule to enable the snapshotMoveData feature. This feature allows data captured from a snapshot to be copied to the object storage location.
Important: Enabling this parameter will cause Velero to upload all data from the snapshotted volumes to S3 using Kopia. While backups are deduplicated, significant storage usage is still expected. To enable this parameter in the full schedule:
```
...
  dr:
    velero:
      schedules:
        install: true
        definitions:
          full:
            snapshotMoveData: false
...
```
General example to enable Volume Snapshotting on rook-ceph (from our storage add-on module):
```
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: velero-snapclass
  labels:
    velero.io/csi-volumesnapshot-class: "true"
  driver: rook-ceph.rbd.csi.ceph.com
  parameters:
    clusterID: rook-ceph
    csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
  deletionPolicy: Retain
```
deletionPolicy: Retain is important because if the volume snapshot is deleted from the namespace, the cluster-wide VolumeSnapshotContent CR will be preserved, maintaining the snapshot on the storage that the cluster is using.
DR optional snapshot-controller installation:
To leverage VolumeSnapshots on the OnPremises and KFDDistribution providers, a new option on Velero has been added to install the snapshot-controller component.
Important: Before activating this parameter, make sure that there is no other snapshot-controller component deployed in your cluster. By default, this parameter is false.
```
...
  dr:
    velero:
      snapshotController:
        install: true
...
```

Breaking Changes 💔

DR Schema change:
A new format for the schedule customization has been introduced to improve usability.

…templates to include snapshot-controller deployment, and enabling volume snapshotting on OnPremises and KFDDistribution providers. Complete refactor of velero schedules settings, simplifying and improving the definition

…apshotMoveData field. Adding also snapshotController installation flag

…, since EKS will install it's own with the aws module

luigidematteis · 2024-11-17T19:57:00Z

Backup and restore with snapshotMoveData enabled, verifying that a backup performed on a cluster, after destroying and recreating the cluster, can be restored by pointing Velero to the same object storage.
EDIT 1: currently facing this issue on restoring: Fail get DataUploadResult for restore - multiple DataUpload result cms found Fail get DataUploadResult for restore - multiple DataUpload result cms found vmware-tanzu/velero#7057 (comment)

@nutellino maybe the problem is due to the fact that we are backing up all Velero's resources, which does not seem to be recommended:

https://github.com/vmware-tanzu/velero/issues/7057#issuecomment-1794110211 

I haven't found any specific directive about this in Velero docs, actually. 

On the other hand, I see that there are some MushHave resources needed to successfully restore a backup: 

vmware-tanzu/velero#6709

and from what I understand, DataUpload is needed for restoring volumes. 

I think it would require further investigation to clearly understand what happened, and what we can do

nutellinoit · 2024-11-18T07:37:15Z

Backup and restore with snapshotMoveData enabled, verifying that a backup performed on a cluster, after destroying and recreating the cluster, can be restored by pointing Velero to the same object storage.
EDIT 1: currently facing this issue on restoring: Fail get DataUploadResult for restore - multiple DataUpload result cms found Fail get DataUploadResult for restore - multiple DataUpload result cms found vmware-tanzu/velero#7057 (comment)

@nutellino maybe the problem is due to the fact that we are backing up all Velero's resources, which does not seem to be recommended:

https://github.com/vmware-tanzu/velero/issues/7057#issuecomment-1794110211 

I haven't found any specific directive about this in Velero docs, actually. 

On the other hand, I see that there are some MushHave resources needed to successfully restore a backup: 

vmware-tanzu/velero#6709

and from what I understand, DataUpload is needed for restoring volumes. 

I think it would require further investigation to clearly understand what happened, and what we can do

No I did some more testing and the issue arise when there are more than one backup for the same volume in kopia. I added some more edits with the workaround in the issue description

nutellinoit · 2024-11-18T10:13:09Z

Update:

If you create a backup without starting from the full schedule, for example with:

velero backup create <backup-name> --snapshot-move-data --include-namespaces <namespace-name>

All the restore from backups even subsequent ones are working fine. The only issue arise if you do a backup starting from the schedule full

@luigidematteis is trying to find the reason behind this

nutellinoit added 9 commits November 16, 2024 17:40

feat: schema change to reflect new schedule definition, with added sn…

3ed10e2

…apshotMoveData field. Adding also snapshotController installation flag

docs: add WIP release note with informations on DR changes

2cefaeb

feat: align eks and kfddistro with new schedule settings

d9bc4c9

feat: moving snapshotcontroller installation under provider type none…

8d1fcea

…, since EKS will install it's own with the aws module

fix: restore delete schema piece

b2781b2

docs: regenerate docs

8220d82

feat: regenerate private eks cluster schema

969754c

feat: regenerate go schema files

64dc6d0

nutellinoit requested review from ralgozino and luigidematteis November 16, 2024 17:06

nutellinoit added 2 commits November 17, 2024 12:04

feat: add experimental mark on snapshotMoveData parameter

8bc33e6

chores: update go library and docs

1527a06

ralgozino changed the title ~~Feat: update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features~~ feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features Nov 19, 2024

Merge branch 'feat/release-v1.30.0' into feat/update-dr-to-v3.0.0-rc

96794e7

nutellinoit merged commit 62b9e47 into feat/release-v1.30.0 Nov 20, 2024
1 check failed

nutellinoit deleted the feat/update-dr-to-v3.0.0-rc branch November 20, 2024 08:59

roman-parkhunovskyi mentioned this pull request Nov 29, 2024

Fail get DataUploadResult for restore - multiple DataUpload result cms found vmware-tanzu/velero#7057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features #304

feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features #304

nutellinoit commented Nov 16, 2024 •

edited

Loading

luigidematteis commented Nov 17, 2024 •

edited

Loading

nutellinoit commented Nov 18, 2024

nutellinoit commented Nov 18, 2024

feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features #304

feat(DR): update dr to v3.0.0 rc, refactor schedules, add new snapshotController and snapshotMoveData features #304

Conversation

nutellinoit commented Nov 16, 2024 • edited Loading

Checklist for Testing 🧪

Details

Breaking Changes 💔

luigidematteis commented Nov 17, 2024 • edited Loading

nutellinoit commented Nov 18, 2024

nutellinoit commented Nov 18, 2024

nutellinoit commented Nov 16, 2024 •

edited

Loading

luigidematteis commented Nov 17, 2024 •

edited

Loading