Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bottomless: less bugs more robustness #1685

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

sivukhin
Copy link
Contributor

@sivukhin sivukhin commented Aug 21, 2024

Context

There are several known issues with bottomless restore process:

  1. There is a bug in case when S3 has more than 1 page of data - in this case bottomless always stopped it's work after first page due to incorrect usage of last_received_frame_no var
  2. bottomless relies on the fact that last connection will perform checkpoint. This is true if DB is valid, but in case of malformed DB last connection will just exit silently and leave DB empty (4KB DB file and some data in WAL). Current implementation will ignore this situation and just restore empty DB

Changes

  1. Fixed bug with restore process from more than 1 page in S3
  2. Add validation that after drop of the last connection there will be no WAL files on the disk. In other case now bottomless will fail to restore because most probably DB were malformed
  3. Added BOTTOMLESS CAUTION prefix to all cases when bottomless can behave kind of fishy
  4. Added simple restore_from_partial_db test which drops several files from S3 and check that DB will be able to start from this partial backup
    • This is not immediately trivial why we need to restore in such cases - but as server can crash at any point of time and we are uploading frame ranges in parallel - this is a valid case that some small suffix of frame ranges can have a gap. So we can't just easily fail restore process because it will create troubles in "almost valid scenario"

- now we are relying on the fact that last SQLite connection will perform checkpoint
  this is fragile because if DB + WAL malformed somehow - SQLite will exit silently

- one way to resolve this issue is to trigger wal_checkpoint(TRUNCATE)
  manually, but this potentially can interfere with bottomless somehow

- so, more robust way to resolve this issue were implemented: we just
  check that WAL was transfered and -wal + -shm files were deleted. If
  no - we abort restore process
@sivukhin sivukhin changed the title Bottomless less bugs more robustness bottomless: less bugs more robustness Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant