Add support for managing a fixed number of retained snapshots #3942

pcholakov · 2025-11-03T12:53:49Z

With this change, Restate can be configured to retain a set of recent snapshots, and automatically delete older snapshots that are no longer needed. Previously, older snapshots were not managed by Restate and users are expected to figure out how to safely clean them up. This change saves users from having to implement an external lifecycle policy that respects the latest snapshot necessary for bootstrap.

When explicit snapshot retention is specified, the reported Archived LSN will be that of the earliest retained snapshot. Together with the durability setting, this influences the automatic log trim behavior. When auto trim respects the Archived LSN, any retained snapshot can be used to bootstrap a partition. For now, falling back to an earlier snapshot requires that the partition's latest.json file is manually updated to point to an earlier snapshot id if necessary, e.g. to deal with corruption.

This change builds on #3918.

Sample minimal configuration file:

[worker]
durability-mode = "snapshot-only"       # use archived LSN as the safe log trim position source for testing on single nodes

[worker.snapshots]
destination = "s3://restate/snapshots"
snapshot-interval-num-records = 1000    # min records per snapshot
snapshot-interval = "5 min"
experimental-retain-snapshots = 10

This configuration means:

create a new snapshot every 5 min, but only if at least 1000 new records have been applied since
retain the latest 10 snapshots, and consider the earliest of these as the archived LSN
automatically delete earlier snapshots from the object store

github-actions · 2025-11-03T13:15:15Z

Test Results

7 files ±0 7 suites ±0 3m 25s ⏱️ +48s
47 tests ±0 47 ✅ ±0 0 💤 ±0 0 ❌ ±0
200 runs ±0 200 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit afba6c6. ± Comparison against base commit ba47e39.

♻️ This comment has been updated with latest results.

tillrohrmann

Thanks for creating this PR @pcholakov. I think it will be great improvement for our users no longer having to manage the snapshots themselves.

Before diving into the details, what was the motivation to explicitly keep track of deletions and retained snapshots from the perspective of the latest snapshot? The extra bookkeeping adds a bit of complexity which could not be necessary if we had a simple periodic snapshot cleaner that periodically lists the snapshot repository and deletes everything except for the latest retained snapshots. Did you want to save S3 get calls? Would there be a problem with reporting the archived lsn (or the lsn that no snapshot refers to anymore)? Or is this a preparation for things that will become necessary once we add support for incremental snapshots? Or is the idea that in the future there can be different retention policies which makes us want to track exactly which snapshots to retain and which ones to delete?

tillrohrmann · 2025-11-13T22:25:06Z

crates/partition-store/src/snapshots/snapshot_task.rs

+                        snapshot_lsn = %metadata.min_applied_lsn,
+                        archived_lsn = %archived_lsn.get_archived_lsn(),


What's the difference between snapshot lsn and archived lsn?

Ok, I assume that the latter is the snapshot lsn of the earliest retained snapshot.

tillrohrmann · 2025-11-13T22:27:34Z

crates/worker/src/partition/leadership/durability_tracker.rs

-/// A stream that tracks the last reported durable Lsn, replica-set durable points, and
-/// last archived lsn and emits a [`PartitionDurability`] when the durable Lsn changes.
+/// A stream that tracks the last reported durable Lsn, replica-set durable points, and archived LSN
+/// (from snapshot repository), and emits a [`PartitionDurability`] when the durable lSN changes.


tillrohrmann · 2025-11-13T22:32:24Z

crates/worker/src/partition_processor_manager.rs


    pending_snapshots: HashMap<PartitionId, PendingSnapshotTask>,
-    latest_snapshots: HashMap<PartitionId, SnapshotCreated>,
+    latest_snapshots: HashMap<PartitionId, LatestSnapshot>, // NB: latest snapshot min LSN != archived LSN, necessarily


The comment holds only true if num retained snapshots > 1, right? And even then, snapshots might have the same lsn (admittedly this is a corner case).

tillrohrmann · 2025-11-13T22:46:46Z

crates/worker/src/partition_processor_manager.rs


    pending_snapshots: HashMap<PartitionId, PendingSnapshotTask>,
-    latest_snapshots: HashMap<PartitionId, SnapshotCreated>,
+    latest_snapshots: HashMap<PartitionId, LatestSnapshot>, // NB: latest snapshot min LSN != archived LSN, necessarily


What's the difference between latest_snapshots and archived_lsns? The latter seems to contain a subset of the information of the former. Could this be unified?

tillrohrmann · 2025-11-14T09:25:10Z