Status: Accepted
During long-running restic backups/restores, there is no visibility into what (if anything) is happening, making it hard to know if the backup/restore is making progress or hung, how long the operation might take, etc. We should capture progress during restic operations and make it user-visible so that it's easier to reason about. This document proposes an approach for capturing progress of backup and restore operations and exposing this information to users.
- Provide basic visibility into restic operations to inform users about their progress.
- Capturing progress for non-restic backups and restores.
(Omitted, see introduction)
The restic backup
command provides progress reporting to stdout in JSON format, which includes the completion percentage of the backup.
This progress will be read on some interval and the PodVolumeBackup Custom Resource's (CR) status will be updated with this information.
The restic stats
command returns the total size of a backup.
This can be compared with the total size the volume periodically to calculate the completion percentage of the restore.
The PodVolumeRestore CR's status will be updated with this information.
A new Progress
field will be added to PodVolumeBackupStatus and PodVolumeRestoreStatus of type PodVolumeOperationProgress
:
type PodVolumeOperationProgress struct {
TotalBytes int64
BytesDone int64
}
restic added support for streaming JSON output for the restic backup
command in 0.9.5.
Our current images ship restic 0.9.4, and so the Dockerfile will be updated to pull the new version: https://github.com/heptio/velero/blob/af4b9373fc73047f843cd4bc3648603d780c8b74/Dockerfile-velero#L21.
With the --json
flag, restic backup
outputs single lines of JSON reporting the status of the backup:
{"message_type":"status","percent_done":0,"total_files":1,"total_bytes":21424504832}
{"message_type":"status","action":"scan_finished","item":"","duration":0.219241873,"data_size":49461329920,"metadata_size":0,"total_files":10}
{"message_type":"status","percent_done":0,"total_files":10,"total_bytes":49461329920,"current_files":["/file3"]}
{"message_type":"status","percent_done":0.0003815984736061056,"total_files":10,"total_bytes":49461329920,"bytes_done":18874368,"current_files":["/file1","/file3"]}
{"message_type":"status","percent_done":0.0011765952936188255,"total_files":10,"total_bytes":49461329920,"bytes_done":58195968,"current_files":["/file1","/file3"]}
{"message_type":"status","percent_done":0.0019503921984312064,"total_files":10,"total_bytes":49461329920,"bytes_done":96468992,"current_files":["/file1","/file3"]}
{"message_type":"status","percent_done":0.0028089887640449437,"total_files":10,"total_bytes":49461329920,"bytes_done":138936320,"current_files":["/file1","/file3"]}
The command factory for backup will be updated to include the --json
flag.
The code to run the restic backup
command (https://github.com/heptio/velero/blob/af4b9373fc73047f843cd4bc3648603d780c8b74/pkg/controller/pod_volume_backup_controller.go#L241) will be changed to include a Goroutine that reads from the command's stdout stream.
The implementation of this will largely follow @jmontleon's PoC of this.
The Goroutine will periodically read the stream (every 10 seconds) and get the last printed status line, which will be convered to JSON.
If bytes_done
is empty, restic has not finished scanning the volume and hasn't calculated the total_bytes
.
In this case, we will not update the PodVolumeBackup and instead will wait for the next iteration.
Once we get a non-zero value for bytes_done
, the bytes_done
and total_bytes
properties will be read and the PodVolumeBackup will be patched to update status.Progress.BytesDone
and status.Progress.TotalBytes
respectively.
Once the backup has completed successfully, the PodVolumeBackup will be patched to set status.Progress.BytesDone = status.Progress.TotalBytes
.
This is done since the main thread may cause early termination of the Goroutine once the operation has finished, preventing a final update to the BytesDone
property.
The restic stats <snapshot_id> --json
command provides information about the size of backups:
{"total_size":10558111744,"total_file_count":11}
Before beginning the restore operation, we can use the output of restic stats
to get the total size of the backup.
The PodVolumeRestore will be patched to set status.Progress.TotalBytes
to the total size of the backup.
The code to run the restic restore
command will be changed to include a Goroutine that periodically (every 10 seconds) gets the current size of the volume.
To get the current size of the volume, we will recursively walkthrough all files in the volume to accumulate the total size.
The current total size is the number of bytes transferred so far and the PodVolumeRestore will be patched to update status.Progress.BytesDone
.
Once the restore has completed successfully, the PodVolumeRestore will be patched to set status.Progress.BytesDone = status.Progress.TotalBytes
.
This is done since the main thread may cause early termination of the Goroutine once the operation has finished, preventing a final update to the BytesDone
property.
The output that describes detailed information about PodVolumeBackups and PodVolumeRestores will be updated to calculate and display a completion percentage from status.Progress.TotalBytes
and status.Progress.BytesDone
if available.
- Can we assume that the volume we are restoring in will be empty? Can it contain other artefacts?
- Based on discussion in this PR, we are okay making the assumption that the PVC is empty and will proceed with the above proposed approach.
If we cannot assume that the volume we are restoring into will be empty, we can instead use the output from restic snapshot
to get the list of files in the backup.
This can then be used to calculate the current total size of just those files in the volume, so that we avoid considering any other files unrelated to the backup.
The above proposed approach is simpler than this one, as we don't need to keep track of each file in the backup, but this will be more robust if the volume could contain other files not included in the backup.
It's possible that certain volume types may contain hidden files that could attribute to the total size of the volume, though these should be small enough that the BytesDone calculation will only be slightly inflated.
Another option is to contribute progress reporting similar to restic backup
for restic restore
upstream.
This may take more time, but would give us a more native view on the progress of a restore.
There are several issues about this already in the restic repo (restic/restic#426, restic/restic#1154), and what looks like an abandoned attempt (restic/restic#2003) which we may be able to pick up.
N/A