Skip to content

Latest commit

 

History

History
116 lines (78 loc) · 7.68 KB

restic-backup-and-restore-progress.md

File metadata and controls

116 lines (78 loc) · 7.68 KB

Progress reporting for restic backups and restores

Status: Accepted

During long-running restic backups/restores, there is no visibility into what (if anything) is happening, making it hard to know if the backup/restore is making progress or hung, how long the operation might take, etc. We should capture progress during restic operations and make it user-visible so that it's easier to reason about. This document proposes an approach for capturing progress of backup and restore operations and exposing this information to users.

Goals

  • Provide basic visibility into restic operations to inform users about their progress.

Non Goals

  • Capturing progress for non-restic backups and restores.

Background

(Omitted, see introduction)

High-Level Design

restic backup progress

The restic backup command provides progress reporting to stdout in JSON format, which includes the completion percentage of the backup. This progress will be read on some interval and the PodVolumeBackup Custom Resource's (CR) status will be updated with this information.

restic restore progress

The restic stats command returns the total size of a backup. This can be compared with the total size the volume periodically to calculate the completion percentage of the restore. The PodVolumeRestore CR's status will be updated with this information.

Detailed Design

Changes to PodVolumeBackup and PodVolumeRestore Status type

A new Progress field will be added to PodVolumeBackupStatus and PodVolumeRestoreStatus of type PodVolumeOperationProgress:

type PodVolumeOperationProgress struct {
  TotalBytes int64
  BytesDone int64
}

restic backup progress

restic added support for streaming JSON output for the restic backup command in 0.9.5. Our current images ship restic 0.9.4, and so the Dockerfile will be updated to pull the new version: https://github.com/heptio/velero/blob/af4b9373fc73047f843cd4bc3648603d780c8b74/Dockerfile-velero#L21. With the --json flag, restic backup outputs single lines of JSON reporting the status of the backup:

{"message_type":"status","percent_done":0,"total_files":1,"total_bytes":21424504832}
{"message_type":"status","action":"scan_finished","item":"","duration":0.219241873,"data_size":49461329920,"metadata_size":0,"total_files":10}
{"message_type":"status","percent_done":0,"total_files":10,"total_bytes":49461329920,"current_files":["/file3"]}
{"message_type":"status","percent_done":0.0003815984736061056,"total_files":10,"total_bytes":49461329920,"bytes_done":18874368,"current_files":["/file1","/file3"]}
{"message_type":"status","percent_done":0.0011765952936188255,"total_files":10,"total_bytes":49461329920,"bytes_done":58195968,"current_files":["/file1","/file3"]}
{"message_type":"status","percent_done":0.0019503921984312064,"total_files":10,"total_bytes":49461329920,"bytes_done":96468992,"current_files":["/file1","/file3"]}
{"message_type":"status","percent_done":0.0028089887640449437,"total_files":10,"total_bytes":49461329920,"bytes_done":138936320,"current_files":["/file1","/file3"]}

The command factory for backup will be updated to include the --json flag. The code to run the restic backup command (https://github.com/heptio/velero/blob/af4b9373fc73047f843cd4bc3648603d780c8b74/pkg/controller/pod_volume_backup_controller.go#L241) will be changed to include a Goroutine that reads from the command's stdout stream. The implementation of this will largely follow @jmontleon's PoC of this. The Goroutine will periodically read the stream (every 10 seconds) and get the last printed status line, which will be convered to JSON. If bytes_done is empty, restic has not finished scanning the volume and hasn't calculated the total_bytes. In this case, we will not update the PodVolumeBackup and instead will wait for the next iteration. Once we get a non-zero value for bytes_done, the bytes_done and total_bytes properties will be read and the PodVolumeBackup will be patched to update status.Progress.BytesDone and status.Progress.TotalBytes respectively.

Once the backup has completed successfully, the PodVolumeBackup will be patched to set status.Progress.BytesDone = status.Progress.TotalBytes. This is done since the main thread may cause early termination of the Goroutine once the operation has finished, preventing a final update to the BytesDone property.

restic restore progress

The restic stats <snapshot_id> --json command provides information about the size of backups:

{"total_size":10558111744,"total_file_count":11}

Before beginning the restore operation, we can use the output of restic stats to get the total size of the backup. The PodVolumeRestore will be patched to set status.Progress.TotalBytes to the total size of the backup.

The code to run the restic restore command will be changed to include a Goroutine that periodically (every 10 seconds) gets the current size of the volume. To get the current size of the volume, we will recursively walkthrough all files in the volume to accumulate the total size. The current total size is the number of bytes transferred so far and the PodVolumeRestore will be patched to update status.Progress.BytesDone.

Once the restore has completed successfully, the PodVolumeRestore will be patched to set status.Progress.BytesDone = status.Progress.TotalBytes. This is done since the main thread may cause early termination of the Goroutine once the operation has finished, preventing a final update to the BytesDone property.

Velero CLI changes

The output that describes detailed information about PodVolumeBackups and PodVolumeRestores will be updated to calculate and display a completion percentage from status.Progress.TotalBytes and status.Progress.BytesDone if available.

Open Questions

  • Can we assume that the volume we are restoring in will be empty? Can it contain other artefacts?
    • Based on discussion in this PR, we are okay making the assumption that the PVC is empty and will proceed with the above proposed approach.

Alternatives Considered

restic restore progress

If we cannot assume that the volume we are restoring into will be empty, we can instead use the output from restic snapshot to get the list of files in the backup. This can then be used to calculate the current total size of just those files in the volume, so that we avoid considering any other files unrelated to the backup. The above proposed approach is simpler than this one, as we don't need to keep track of each file in the backup, but this will be more robust if the volume could contain other files not included in the backup. It's possible that certain volume types may contain hidden files that could attribute to the total size of the volume, though these should be small enough that the BytesDone calculation will only be slightly inflated.

Another option is to contribute progress reporting similar to restic backup for restic restore upstream. This may take more time, but would give us a more native view on the progress of a restore. There are several issues about this already in the restic repo (restic/restic#426, restic/restic#1154), and what looks like an abandoned attempt (restic/restic#2003) which we may be able to pick up.

Security Considerations

N/A