Description
openedon Aug 13, 2024
A problem uncovered by starting to do graceful shutdowns (#8655) in tests and benches, the symptom looks like "infinite layer flushes" even after the test has ended.
Most likely this is fallout from making flush frozen layer loop do RemoteTimelineClient::wait_completion
after each flush in #8550. This has silently broken wait_for_upload_queue_empty
which is now much more likely to see the queue being empty (while the next frozen layer is being written out).
As such, we cannot use wait_for_upload_queue_empty
anymore. It should be replaced with something ("proper checkpoint") which takes in an Lsn, and waits for:
- all in-memory layers to be flushed and uploaded together with
index_part.json
uploads - additionally doing an remote_consistent_lsn increase over any lsn gap
It should then be used with flush_ep_to_pageserver
to get an Lsn (or do we need the lsn range?).
Completion criteria:
- the
wait_for_upload_queue_empty
will no longer exist - regress suite converges to
flush_ep_to_pageserver
and the new thing from above- we need to make it handle lsn gaps, so after we've received the
last_record_lsn
we can flush frozen layers, which will advance the lsn over the gap
- we need to make it handle lsn gaps, so after we've received the
Slack thread ref: https://neondb.slack.com/archives/C060CNA47S9/p1723564732056149?thread_ts=1723559868.379279&cid=C060CNA47S9
This might actually be fallout from #8550. That made the wait_for_upload_queue_empty check fail.
We should really get rid of all
wait_for_upload_queue_empty
and instead have a checkpointing mode where we provide the lsn (for example, received fromflush_ep_to_pageserver
) and checkpoint waits until remote_consistent_lsn is at that (uploads have completed).
Originally posted by @koivunej in #8712
Adding the bug label so we will triage this.