Skip to content

test: wait_for_upload_queue_empty no longer works after #8550 #8715

Open

Description

A problem uncovered by starting to do graceful shutdowns (#8655) in tests and benches, the symptom looks like "infinite layer flushes" even after the test has ended.

Most likely this is fallout from making flush frozen layer loop do RemoteTimelineClient::wait_completion after each flush in #8550. This has silently broken wait_for_upload_queue_empty which is now much more likely to see the queue being empty (while the next frozen layer is being written out).

As such, we cannot use wait_for_upload_queue_empty anymore. It should be replaced with something ("proper checkpoint") which takes in an Lsn, and waits for:

  • all in-memory layers to be flushed and uploaded together with index_part.json uploads
  • additionally doing an remote_consistent_lsn increase over any lsn gap

It should then be used with flush_ep_to_pageserver to get an Lsn (or do we need the lsn range?).

Completion criteria:

  • the wait_for_upload_queue_empty will no longer exist
  • regress suite converges to flush_ep_to_pageserver and the new thing from above
    • we need to make it handle lsn gaps, so after we've received the last_record_lsn we can flush frozen layers, which will advance the lsn over the gap

Slack thread ref: https://neondb.slack.com/archives/C060CNA47S9/p1723564732056149?thread_ts=1723559868.379279&cid=C060CNA47S9


This might actually be fallout from #8550. That made the wait_for_upload_queue_empty check fail.

We should really get rid of all wait_for_upload_queue_empty and instead have a checkpointing mode where we provide the lsn (for example, received from flush_ep_to_pageserver) and checkpoint waits until remote_consistent_lsn is at that (uploads have completed).

Originally posted by @koivunej in #8712

Adding the bug label so we will triage this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    a/testArea: related to testingc/storage/pageserverComponent: storage: pageservertriagedbugs that were already triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions