Skip to content

Tombsweeper: clean stale delete markers #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 23, 2025
Merged

Tombsweeper: clean stale delete markers #89

merged 9 commits into from
May 23, 2025

Conversation

wojas
Copy link
Member

@wojas wojas commented May 16, 2025

The 'tombsweeper' (sweeper in short) cleans old stale delete markers from the LMDB.

Closes #81

WARNING: This is DISABLED by default, because enabling this functionality requires careful consideration. When enabled, you MUST make sure that no instance that has been offline for longer than the sweeper retention_days will ever reconnect. If this does happen, old entries that have been deleted may be resurrected causing anything from undesired results to database state corruption. Be especially careful about development or testing systems that only come online occasionally.

For small personal installations with infrequent changes, it is probably better NOT to enable this functionality, and accept that deleted entries accumulate over time. Only consider this you have frequent updates and deletion markers would actually become a problem over time. And in that case, be very conservative with the retention_days setting, the longer the better.

Progress

  • Working Sweeper with incremental sweep support
  • Configuration
  • Filter stale entries when loading snapshots
  • Add metrics to track deleted entries

Likely later in a different PR:

  • Document this part of LS

New YAML example config section:

# Sweeper settings for the LMDB sweeper that removed deleted entries after
# a while, also known as the "tomb sweeper".
#
# The key consideration for these settings is how long instance can be
# expected to be disconnected from the storage (out of sync) before
# rejoining. If the retention interval is set too low, old records that
# have been removed during the downtime can reappear, which can cause
# major issues.
#
# When picking a value, also take into account development, testing and
# migration systems that only occasionally come online.
#
sweeper:
  # Enabled controls if the sweeper is enabled.
  # It is DISABLED by default, because of the important consistency
  # considerations that depend on the kind of deployment.
  # When disabled, the deleted entries will never actually be removed.
  # Stats are only available when the sweeper is enabled.
  #enabled: false

  # RetentionDays is the number of DAYS of retention. Unlike in most
  # other places, this is specified in number of days instead of Duration
  # because of the expected length of this.
  # This is a float, so it is possible to use periods shorter than one day,
  # but this is rarely a good idea. Best to set this as high as possible.
  # Default: 370 (days, intentionally on the safe side)
  #retention_days: 370

  # Interval is the interval between sweeps of the whole database to enforce
  # RetentionDays.
  # As a guideline, on a fast server sweeping 1 million records takes
  # about 1 second.
  # Default: 6h
  #interval: 6h

  # FirstInterval is the first Interval immediately after
  # startup, to allow one soon after extended downtime.
  # Default: 10m
  #first_interval: 10m

  # LockDuration limits how long the sweeper may hold the exclusive write
  # lock at one time. This effectively controls the maximum latency spike
  # due to the sweeper for API calls that update the LMDB.
  # This is not a hard quota, the sweeper may overrun it slightly.
  # Default: 50ms
  #lock_duration: 50ms

  # ReleaseDuration determines how long the sweeper must sleep before it
  # is allowed to reacquire the exclusive write lock.
  # If this is equal to LockDuration, it means that the sweeper can hold the
  # LMDB at most half the time.
  # Do not set this too high, as every sweep cycle will record a write
  # transaction that can trigger a snapshot generation scan. It is best
  # to get it over with in a short total sweep time.
  # Default: 50ms
  #release_duration: 50ms

New metrics:

# HELP lightningstream_sweeper_cleaned_total Number of stale deletion markers cleaned by the sweeper
# TYPE lightningstream_sweeper_cleaned_total counter
lightningstream_sweeper_cleaned_total{lmdb="main"} 0
lightningstream_sweeper_cleaned_total{lmdb="shard"} 0
# HELP lightningstream_sweeper_duration_seconds Summary of time taken by sweeper
# TYPE lightningstream_sweeper_duration_seconds summary
lightningstream_sweeper_duration_seconds_sum{lmdb="main"} 0.000113072
lightningstream_sweeper_duration_seconds_count{lmdb="main"} 1
lightningstream_sweeper_duration_seconds_sum{lmdb="shard"} 4.4089e-05
lightningstream_sweeper_duration_seconds_count{lmdb="shard"} 1
# HELP lightningstream_sweeper_stats_available Set to 1 when the stats are available after a sweep run
# TYPE lightningstream_sweeper_stats_available gauge
lightningstream_sweeper_stats_available{lmdb="main"} 1
lightningstream_sweeper_stats_available{lmdb="shard"} 1
# HELP lightningstream_sweeper_stats_deleted_entries Deleted entries after last sweeper run
# TYPE lightningstream_sweeper_stats_deleted_entries gauge
lightningstream_sweeper_stats_deleted_entries{lmdb="main"} 0
lightningstream_sweeper_stats_deleted_entries{lmdb="shard"} 0
# HELP lightningstream_sweeper_stats_total_entries Total entries after last sweeper run, including deleted
# TYPE lightningstream_sweeper_stats_total_entries gauge
lightningstream_sweeper_stats_total_entries{lmdb="main"} 9
lightningstream_sweeper_stats_total_entries{lmdb="shard"} 5

wojas added 5 commits May 14, 2025 21:50
Add a Sweeper that cleans stale deleted entries according to a
configuration. The individual runs are constrained by a time limit that
the write lock is held.

This is not actually called yet when running the syncer.
Actually run the sweeper from sync, and enable in the devenv.

Fix incorrect scheduling and circular import.
When the sweeper is enabled, we need to filter out stale deletion
markers when loading snapshots to not resurrect zombies.
@wojas wojas added this to the v0.5.0 milestone May 16, 2025
@wojas wojas changed the title [WIP] Tombsweeper: clean stale delete markers Tombsweeper: clean stale delete markers May 16, 2025
@nvaatstra
Copy link
Contributor

Looks good to me (being aware of issue #90 as a response to the WARNING). Logic seems solid and defaults seem sensible.

@wojas wojas requested review from neilcook and nvaatstra May 23, 2025 09:24
Copy link
Contributor

@nvaatstra nvaatstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wojas wojas merged commit 4520e3d into main May 23, 2025
4 checks passed
@wojas wojas deleted the tombsweeper branch May 23, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tombstone cleanup
2 participants