[Feature] Retry full backups to avoid false positives in alerts #258
Description
Feature (What you would like to be added):
If regular full snapshot backup fails, they are retried only in the next interval (typically, 24h
). We should retry the full snapshot backup in shorter time frame (say, 10m
, 15m
or 20m
).
Motivation (Why is this needed?):
Alerts are configured to fire if the latest full backups are more than 24h
old. These typically get resolved automatically in the next interval when the full backup goes through. So, most of the alerts are false positives that get resolved automatically. This makes it hard to automate the follow up process using ticketing systems (mandated by audit).
Retrying within a range of 10m-20m
might resolve the issue automatically earlier so that alerts fire only if retries also fail.
Approach/Hint to the implement solution (optional):