Libsql-wal: streaming compaction #1762

MarinPostma · 2024-09-30T11:03:41Z

This PR implements streaming compaction for libsql-wal.

Motivation

Before, we used buffer files for segment compaction. When long sequences of segments needed a lot of disk storage to store intermediate segments and the resulting segments. With some changes to the compacted segment format this PR enables the streaming compaction of segments.

How?

Initially, the CompactedSegments contained a header with the number of frames in the segment. When compacting we can't cheaply know how many frames will be in the resulting segment, so we needed a different way know how many frames there are in the segment. The segment also contained a footer with the checksum, but without the frame count, it's impossible to know where to fetch the footer. Instead, we change the frame headers, introducing a CompactedFrameHeader. The compacted frame headers drop the size_after (all frames in a compacted segment should be logically committed together) field, and introduce a checksum field, and a flag field, with the LAST flag. The LAST flag is set for the last flag in the segment.

The checksum is computed as the crc32 frame header + data (expect the checksum), seeded by the checksum of the previous frame. The first frame is seeded with the checksum of the segment header.

The dedup_stream method in the compactor is the meat of this PR. It takes a SegmentSet, and returns a deduplicated set of all the frame for that set. Here's how it works:
Iterating on the segments in the set backwards (most recent segment first), we start downloading indexed (this step is done conccurently). Then, we sequentially iterate over the received segments, and check if that segment contains any data that we need. To do this, the maintain a seen_pages bitset with all the pages we have already collected. If any page in the segment index is not in the set, we download the segment data. For every frame in the segment data whose page we haven't seen, we stream that page out. We repeat this process, until we either have enough pages (as indicated by size after), or run out ot segments to search in.

MarinPostma added 2 commits September 30, 2024 12:25

cleanup wal toolkit

726c40c

make MapSlice more general

912aaf3

MarinPostma enabled auto-merge September 30, 2024 11:04

streaming compaction

06e177f

MarinPostma force-pushed the streaming-compaction branch from 7ec81fc to 06e177f Compare September 30, 2024 11:22

MarinPostma added this pull request to the merge queue Sep 30, 2024

Merged via the queue into main with commit 8abff7b Sep 30, 2024
18 checks passed

MarinPostma deleted the streaming-compaction branch September 30, 2024 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libsql-wal: streaming compaction #1762

Libsql-wal: streaming compaction #1762

MarinPostma commented Sep 30, 2024

Libsql-wal: streaming compaction #1762

Libsql-wal: streaming compaction #1762

Conversation

MarinPostma commented Sep 30, 2024

Motivation

How?