Compaction to cut output file if it can avoid to overlap a grandparent file #1963

siying · 2017-03-08T06:43:24Z

Summary:
By cutting compaction output files in level N, sometimes, we can avoid one file in level N+1 from overlapping with files in level N.
Consider following case.
Level N-1: [1, 21]
Level N: [3, 23]
Level N+1 [2, 4] [11, 15] [22, 24]
(each [] represents a file and each number represents a key in the file)
If we compact the files in N-1 and N, before the commit, we'll generate one single file in Level N:
[1, 3, 21, 23]
However, this will overlap with all the three files on level N+1. With the commit, we will cut the file into two:
[1, 3] [21, 23]
so that, the file [11, 15] will not overlap with any file in level N. This will reduce some write amplification when compacting N => N + 1

Test Plan: Add test cases in compaction_job_test to cover this case. Also add a test case to make sure normal max_compaction_bytes limit is still satisfied.

facebook-github-bot · 2017-03-08T06:43:57Z

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying · 2017-03-08T06:44:35Z

@yoshinorim I'm hoping this can help in reduce write amplification in the use case you are working on.

siying · 2017-03-08T07:20:08Z

Here is a friendly benchmark scenario: the time series inserts with 10 time series.

Before: compaction writes 9.89 GB, flush writes 0.612 GB
After: compaction writes 3.07 GB, flush writes 0.612 GB

The command:

./db_bench --benchmarks=timeseries,stats --num=10000000 --key_id_range=10 --write_buffer_size=2000000 --target_file_size_base=2000000 --threads=1 -benchmark_write_rate_limit=6000000 -max_bytes_for_level_multiplier=5 -max_background_compactions=8

facebook-github-bot · 2017-03-08T20:01:24Z

@siying updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-08T20:01:51Z

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-03-08T20:33:31Z

@siying updated the pull request - view changes - changes since last import

…t file Summary: By cutting compaction output files in level N, sometimes, we can avoid one file in level N+1 from overlapping with files in level N. Consider following case. Level N-1: [1, 21] Level N: [3, 23] Level N+1 [2, 4] [11, 15] [22, 24] (each [] represents a file and each number represents a key in the file) If we compact the files in N-1 and N, before the commit, we'll generate one single file in Level N: [1, 3, 21, 23] However, this will overlap with all the three files on level N+1. With the commit, we will cut the file into two: [1, 3] [21, 23] so that, the file [11, 15] will not overlap with any file in level N. This will reduce some write amplification when compacting N => N + 1 Test Plan: Add test cases in compaction_job_test to cover this case. Also add a test case to make sure normal max_compaction_bytes limit is still satisfied.

facebook-github-bot · 2017-03-11T00:27:57Z

@siying updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-13T16:28:40Z

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

yoshinorim · 2017-03-14T23:28:37Z

@siying After running db_bench, was the number of sst files noticeably changed with your diff (were many smaller sst files created with your diff)?

siying · 2017-03-14T23:34:28Z

@yoshinorim for this benchmark, it didn't. I am worried about some extreme cases where many small files may be created, and I am thinking of ways to mitigate the risk.

IslamAbdelRahman · 2017-03-17T21:15:10Z

db/compaction_job.cc

    while (grandparent_index < grandparents.size() &&
           icmp->Compare(internal_key,
                         grandparents[grandparent_index]->largest.Encode()) >
               0) {
      if (seen_key) {
        overlapped_bytes += grandparents[grandparent_index]->fd.GetFileSize();
+        if (grandparent_index == 0 ||
+            grandparents[grandparent_index]->fd.GetFileSize() >
+                compaction->max_output_file_size() / 8) {


Maybe I am missing something, but should not this condition
grandparents[grandparent_index]->fd.GetFileSize() > compaction->max_output_file_size() / 8
be true most of the time ?

Does that mean that we limit the number of grandparents to be maximum 2 files ?

grandparent_index_moved is set to 0 in line 174, so only if the parent index moves twice when we add a key we skip it. We can only move the index twice if the next file is totally between the current key and previous key.

IslamAbdelRahman · 2017-03-20T17:10:29Z

will setting a small number for max_compaction_bytes give us the same effect ?

siying · 2017-03-20T17:12:42Z

@IslamAbdelRahman setting a small max_compaction_bytes will make files always cut for the boundary of grandparent files. This will always generate very small output files.

gfosco · 2018-01-10T23:23:41Z

Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue.

siying · 2020-12-23T22:56:43Z

I think it's still an idea worth discussing. CC @ajkr

ajkr

Great idea to exploit ShouldStopBefore()'s key-by-key consideration to cut files to minimize overlap!

Maybe add a TODO to take range tombstone into account?

ajkr · 2021-01-04T21:35:21Z

db/compaction_job.cc

@@ -171,12 +171,18 @@ struct CompactionJob::SubcompactionState {
    const std::vector<FileMetaData*>& grandparents = compaction->grandparents();


Not necessary to address for this PR - I feel like this function should operate at the granularity of atomic compaction unit rather than individual files. Particularly now that we cut to avoid overlapping a unit (currently a file), it is only beneficial if that unit also happens to be an atomic compaction unit, meaning it would not get pulled into its neighbors' compactions.

ajkr · 2021-01-04T21:53:09Z

db/compaction_job.cc

    while (grandparent_index < grandparents.size() &&
           icmp->Compare(internal_key,
                         grandparents[grandparent_index]->largest.Encode()) >
               0) {
      if (seen_key) {
        overlapped_bytes += grandparents[grandparent_index]->fd.GetFileSize();
+        if (grandparent_index == 0 ||


I can't really figure out the condition. The size of the file at the initial value of grandparent_index seems it shouldn't matter since we've already overlapped that file and are now considering whether to overlap the next file.

How about something like this:

a copy of the grandparent_index at the start of the function, call it prev_grandparent_index

a flag, nontrivial_file_skipped, which indicates if we've skipped past a file of meaningful size whose index comes after prev_grandparent_index

Or we can do the first bullet point and then have a counter, file_data_skipped, which counts the cumulative size of all files that were skipped over. Then we can base the decision to cut the file on the total data that can be saved from future compaction, rather than whether there was a single file that was big enough.

ajkr · 2021-10-16T19:52:56Z

@yoshinorim for this benchmark, it didn't. I am worried about some extreme cases where many small files may be created, and I am thinking of ways to mitigate the risk.

Can we write down what is the extreme case? I forget what we discussed last time, but the one that comes to mind now is:

Grandparent level has files like [aa az] [ba bz] [ca cz] ...
When we write keys a1, b1, c1, to parent level, they each get their own file so it looks like: [a1] [b1] [c1]
Can it get worse from here? Writing a2, b2, c2, could result in parent level like: [a1] [a2] [b1 b2] [c1 c2].

ajkr · 2021-10-17T18:14:59Z

@yoshinorim for this benchmark, it didn't. I am worried about some extreme cases where many small files may be created, and I am thinking of ways to mitigate the risk.

Can we write down what is the extreme case? I forget what we discussed last time, but the one that comes to mind now is:

Grandparent level has files like [aa az] [ba bz] [ca cz] ...

When we write keys a1, b1, c1, to parent level, they each get their own file so it looks like: [a1] [b1] [c1]

Can it get worse from here? Writing a2, b2, c2, could result in parent level like: [a1] [a2] [b1 b2] [c1 c2].

Besides the possible solution we discussed before (pulling in adjacent tiny files in output level), another possible solution is recompacting tiny files in a later compaction. It might work to first prioritize trivial moving files as far down as possible. Then have a new type of "file size" compaction that compacts groups of adjacent small files within the same level (e.g., smaller than target size / 2). If it's in a non-bottommost level there should be an additional restriction that such a compaction does not include files that overlap multiple files in the parent level. This "file size" compaction could be similar priority to marked files, intra-bottommost, etc. (i.e., score 1.0). The downside is tiny file count can grow continuously during a write burst with a particular pattern.

edit: This solution could also help the problem described in #8306.

Summary: Try to align the compaction output file boundaries to the next level ones (grandparent level), to reduce the level compaction write-amplification. In level compaction, there are "wasted" data at the beginning and end of the output level files. Align the file boundary can avoid such "wasted" compaction. With this PR, it tries to align the non-bottommost level file boundaries to its next level ones. It may cut file when the file size is large enough (at least 50% of target_file_size) and not too large (2x target_file_size). db_bench shows about 12.56% compaction reduction: ``` TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 # baseline: Flush(GB): cumulative 25.882, interval 7.216 Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds # with this change: Flush(GB): cumulative 25.882, interval 7.753 Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds ``` The compaction simulator shows a similar result (14% with 100G random data). As a side effect, with this PR, the SST file size can exceed the target_file_size, but is capped at 2x target_file_size. And there will be smaller files. Here are file size statistics when loading 100GB with the target file size 32MB: ``` baseline this_PR count 1.656000e+03 1.705000e+03 mean 3.116062e+07 3.028076e+07 std 7.145242e+06 8.046139e+06 ``` The feature is enabled by default, to revert to the old behavior disable it with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false` Also includes #1963 to cut file before skippable grandparent file. Which is for use case like user adding 2 or more non-overlapping data range at the same time, it can reduce the overlapping of 2 datasets in the lower levels. Pull Request resolved: #10655 Reviewed By: cbi42 Differential Revision: D39552321 Pulled By: jay-zhuang fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2

lightmark · 2022-11-01T05:18:48Z

close?

ajkr · 2022-11-01T05:30:53Z

Yeah, #10655 included this idea

siying requested review from lightmark, IslamAbdelRahman and ajkr March 8, 2017 06:43

facebook-github-bot added the CLA Signed label Mar 8, 2017

siying force-pushed the cut branch from 0b78904 to d5eb593 Compare March 8, 2017 20:01

siying force-pushed the cut branch from d5eb593 to 83f29fd Compare March 8, 2017 20:32

siying force-pushed the cut branch from 83f29fd to 7fb8b9b Compare March 11, 2017 00:27

IslamAbdelRahman reviewed Mar 17, 2017

View reviewed changes

gfosco added the abandoned-or-aged-out label Jan 10, 2018

gfosco closed this Jan 10, 2018

siying mentioned this pull request Nov 11, 2019

compaction: skip output level files with no data overlap #6021

Closed

siying reopened this Dec 23, 2020

ajkr approved these changes Jan 4, 2021

View reviewed changes

jay-zhuang mentioned this pull request Sep 9, 2022

Align compaction output file boundaries to the next level ones #10655

Closed

ajkr closed this Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction to cut output file if it can avoid to overlap a grandparent file #1963

Compaction to cut output file if it can avoid to overlap a grandparent file #1963

siying commented Mar 8, 2017

facebook-github-bot commented Mar 8, 2017

siying commented Mar 8, 2017

siying commented Mar 8, 2017 •

edited

Loading

facebook-github-bot commented Mar 8, 2017

facebook-github-bot commented Mar 8, 2017

facebook-github-bot commented Mar 8, 2017

facebook-github-bot commented Mar 11, 2017

facebook-github-bot commented Mar 13, 2017

yoshinorim commented Mar 14, 2017 •

edited

Loading

siying commented Mar 14, 2017

IslamAbdelRahman Mar 17, 2017

siying Mar 17, 2017

IslamAbdelRahman commented Mar 20, 2017

siying commented Mar 20, 2017

gfosco commented Jan 10, 2018

siying commented Dec 23, 2020

ajkr left a comment

ajkr Jan 4, 2021

ajkr Jan 4, 2021 •

edited

Loading

ajkr Jan 4, 2021

ajkr commented Oct 16, 2021 •

edited

Loading

ajkr commented Oct 17, 2021 •

edited

Loading

lightmark commented Nov 1, 2022

ajkr commented Nov 1, 2022

		@@ -171,12 +171,18 @@ struct CompactionJob::SubcompactionState {
		const std::vector<FileMetaData*>& grandparents = compaction->grandparents();

Compaction to cut output file if it can avoid to overlap a grandparent file #1963

Compaction to cut output file if it can avoid to overlap a grandparent file #1963

Conversation

siying commented Mar 8, 2017

facebook-github-bot commented Mar 8, 2017

siying commented Mar 8, 2017

siying commented Mar 8, 2017 • edited Loading

facebook-github-bot commented Mar 8, 2017

facebook-github-bot commented Mar 8, 2017

facebook-github-bot commented Mar 8, 2017

facebook-github-bot commented Mar 11, 2017

facebook-github-bot commented Mar 13, 2017

yoshinorim commented Mar 14, 2017 • edited Loading

siying commented Mar 14, 2017

IslamAbdelRahman Mar 17, 2017

Choose a reason for hiding this comment

siying Mar 17, 2017

Choose a reason for hiding this comment

IslamAbdelRahman commented Mar 20, 2017

siying commented Mar 20, 2017

gfosco commented Jan 10, 2018

siying commented Dec 23, 2020

ajkr left a comment

Choose a reason for hiding this comment

ajkr Jan 4, 2021

Choose a reason for hiding this comment

ajkr Jan 4, 2021 • edited Loading

Choose a reason for hiding this comment

ajkr Jan 4, 2021

Choose a reason for hiding this comment

ajkr commented Oct 16, 2021 • edited Loading

ajkr commented Oct 17, 2021 • edited Loading

lightmark commented Nov 1, 2022

ajkr commented Nov 1, 2022

siying commented Mar 8, 2017 •

edited

Loading

yoshinorim commented Mar 14, 2017 •

edited

Loading

ajkr Jan 4, 2021 •

edited

Loading

ajkr commented Oct 16, 2021 •

edited

Loading

ajkr commented Oct 17, 2021 •

edited

Loading