-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction to cut output file if it can avoid to overlap a grandparent file #1963
Conversation
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@yoshinorim I'm hoping this can help in reduce write amplification in the use case you are working on. |
Here is a friendly benchmark scenario: the time series inserts with 10 time series.
The command:
|
@siying updated the pull request - view changes - changes since last import |
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@siying updated the pull request - view changes - changes since last import |
…t file Summary: By cutting compaction output files in level N, sometimes, we can avoid one file in level N+1 from overlapping with files in level N. Consider following case. Level N-1: [1, 21] Level N: [3, 23] Level N+1 [2, 4] [11, 15] [22, 24] (each [] represents a file and each number represents a key in the file) If we compact the files in N-1 and N, before the commit, we'll generate one single file in Level N: [1, 3, 21, 23] However, this will overlap with all the three files on level N+1. With the commit, we will cut the file into two: [1, 3] [21, 23] so that, the file [11, 15] will not overlap with any file in level N. This will reduce some write amplification when compacting N => N + 1 Test Plan: Add test cases in compaction_job_test to cover this case. Also add a test case to make sure normal max_compaction_bytes limit is still satisfied.
@siying updated the pull request - view changes - changes since last import |
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@siying After running db_bench, was the number of sst files noticeably changed with your diff (were many smaller sst files created with your diff)? |
@yoshinorim for this benchmark, it didn't. I am worried about some extreme cases where many small files may be created, and I am thinking of ways to mitigate the risk. |
while (grandparent_index < grandparents.size() && | ||
icmp->Compare(internal_key, | ||
grandparents[grandparent_index]->largest.Encode()) > | ||
0) { | ||
if (seen_key) { | ||
overlapped_bytes += grandparents[grandparent_index]->fd.GetFileSize(); | ||
if (grandparent_index == 0 || | ||
grandparents[grandparent_index]->fd.GetFileSize() > | ||
compaction->max_output_file_size() / 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something, but should not this condition
grandparents[grandparent_index]->fd.GetFileSize() > compaction->max_output_file_size() / 8
be true most of the time ?
Does that mean that we limit the number of grandparents to be maximum 2 files ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grandparent_index_moved is set to 0 in line 174, so only if the parent index moves twice when we add a key we skip it. We can only move the index twice if the next file is totally between the current key and previous key.
will setting a small number for max_compaction_bytes give us the same effect ? |
@IslamAbdelRahman setting a small max_compaction_bytes will make files always cut for the boundary of grandparent files. This will always generate very small output files. |
Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue. |
I think it's still an idea worth discussing. CC @ajkr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea to exploit ShouldStopBefore()
's key-by-key consideration to cut files to minimize overlap!
Maybe add a TODO to take range tombstone into account?
@@ -171,12 +171,18 @@ struct CompactionJob::SubcompactionState { | |||
const std::vector<FileMetaData*>& grandparents = compaction->grandparents(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to address for this PR - I feel like this function should operate at the granularity of atomic compaction unit rather than individual files. Particularly now that we cut to avoid overlapping a unit (currently a file), it is only beneficial if that unit also happens to be an atomic compaction unit, meaning it would not get pulled into its neighbors' compactions.
while (grandparent_index < grandparents.size() && | ||
icmp->Compare(internal_key, | ||
grandparents[grandparent_index]->largest.Encode()) > | ||
0) { | ||
if (seen_key) { | ||
overlapped_bytes += grandparents[grandparent_index]->fd.GetFileSize(); | ||
if (grandparent_index == 0 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really figure out the condition. The size of the file at the initial value of grandparent_index
seems it shouldn't matter since we've already overlapped that file and are now considering whether to overlap the next file.
How about something like this:
- a copy of the
grandparent_index
at the start of the function, call itprev_grandparent_index
- a flag,
nontrivial_file_skipped
, which indicates if we've skipped past a file of meaningful size whose index comes afterprev_grandparent_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can do the first bullet point and then have a counter, file_data_skipped
, which counts the cumulative size of all files that were skipped over. Then we can base the decision to cut the file on the total data that can be saved from future compaction, rather than whether there was a single file that was big enough.
Can we write down what is the extreme case? I forget what we discussed last time, but the one that comes to mind now is:
|
Besides the possible solution we discussed before (pulling in adjacent tiny files in output level), another possible solution is recompacting tiny files in a later compaction. It might work to first prioritize trivial moving files as far down as possible. Then have a new type of "file size" compaction that compacts groups of adjacent small files within the same level (e.g., smaller than target size / 2). If it's in a non-bottommost level there should be an additional restriction that such a compaction does not include files that overlap multiple files in the parent level. This "file size" compaction could be similar priority to marked files, intra-bottommost, etc. (i.e., score 1.0). The downside is tiny file count can grow continuously during a write burst with a particular pattern. edit: This solution could also help the problem described in #8306. |
Summary: Try to align the compaction output file boundaries to the next level ones (grandparent level), to reduce the level compaction write-amplification. In level compaction, there are "wasted" data at the beginning and end of the output level files. Align the file boundary can avoid such "wasted" compaction. With this PR, it tries to align the non-bottommost level file boundaries to its next level ones. It may cut file when the file size is large enough (at least 50% of target_file_size) and not too large (2x target_file_size). db_bench shows about 12.56% compaction reduction: ``` TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 # baseline: Flush(GB): cumulative 25.882, interval 7.216 Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds # with this change: Flush(GB): cumulative 25.882, interval 7.753 Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds ``` The compaction simulator shows a similar result (14% with 100G random data). As a side effect, with this PR, the SST file size can exceed the target_file_size, but is capped at 2x target_file_size. And there will be smaller files. Here are file size statistics when loading 100GB with the target file size 32MB: ``` baseline this_PR count 1.656000e+03 1.705000e+03 mean 3.116062e+07 3.028076e+07 std 7.145242e+06 8.046139e+06 ``` The feature is enabled by default, to revert to the old behavior disable it with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false` Also includes #1963 to cut file before skippable grandparent file. Which is for use case like user adding 2 or more non-overlapping data range at the same time, it can reduce the overlapping of 2 datasets in the lower levels. Pull Request resolved: #10655 Reviewed By: cbi42 Differential Revision: D39552321 Pulled By: jay-zhuang fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
close? |
Yeah, #10655 included this idea |
Summary:
By cutting compaction output files in level N, sometimes, we can avoid one file in level N+1 from overlapping with files in level N.
Consider following case.
Level N-1: [1, 21]
Level N: [3, 23]
Level N+1 [2, 4] [11, 15] [22, 24]
(each [] represents a file and each number represents a key in the file)
If we compact the files in N-1 and N, before the commit, we'll generate one single file in Level N:
[1, 3, 21, 23]
However, this will overlap with all the three files on level N+1. With the commit, we will cut the file into two:
[1, 3] [21, 23]
so that, the file [11, 15] will not overlap with any file in level N. This will reduce some write amplification when compacting N => N + 1
Test Plan: Add test cases in compaction_job_test to cover this case. Also add a test case to make sure normal max_compaction_bytes limit is still satisfied.