Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction to cut output file if it can avoid to overlap a grandparent file #1963

Closed
wants to merge 1 commit into from

Conversation

siying
Copy link
Contributor

@siying siying commented Mar 8, 2017

Summary:
By cutting compaction output files in level N, sometimes, we can avoid one file in level N+1 from overlapping with files in level N.
Consider following case.
Level N-1: [1, 21]
Level N: [3, 23]
Level N+1 [2, 4] [11, 15] [22, 24]
(each [] represents a file and each number represents a key in the file)
If we compact the files in N-1 and N, before the commit, we'll generate one single file in Level N:
[1, 3, 21, 23]
However, this will overlap with all the three files on level N+1. With the commit, we will cut the file into two:
[1, 3] [21, 23]
so that, the file [11, 15] will not overlap with any file in level N. This will reduce some write amplification when compacting N => N + 1

Test Plan: Add test cases in compaction_job_test to cover this case. Also add a test case to make sure normal max_compaction_bytes limit is still satisfied.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@siying
Copy link
Contributor Author

siying commented Mar 8, 2017

@yoshinorim I'm hoping this can help in reduce write amplification in the use case you are working on.

@siying
Copy link
Contributor Author

siying commented Mar 8, 2017

Here is a friendly benchmark scenario: the time series inserts with 10 time series.

Before: compaction writes 9.89 GB, flush writes 0.612 GB
After: compaction writes 3.07 GB, flush writes 0.612 GB

The command:

./db_bench --benchmarks=timeseries,stats --num=10000000 --key_id_range=10 --write_buffer_size=2000000 --target_file_size_base=2000000 --threads=1 -benchmark_write_rate_limit=6000000 -max_bytes_for_level_multiplier=5 -max_background_compactions=8

@facebook-github-bot
Copy link
Contributor

@siying updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@siying updated the pull request - view changes - changes since last import

…t file

Summary:
By cutting compaction output files in level N, sometimes, we can avoid one file in level N+1 from overlapping with files in level N.
Consider following case.
Level N-1: [1, 21]
Level N: [3, 23]
Level N+1 [2, 4] [11, 15] [22, 24]
(each [] represents a file and each number represents a key in the file)
If we compact the files in N-1 and N, before the commit, we'll generate one single file in Level N:
[1, 3, 21, 23]
However, this will overlap with all the three files on level N+1. With the commit, we will cut the file into two:
[1, 3] [21, 23]
so that, the file [11, 15] will not overlap with any file in level N. This will reduce some write amplification when compacting N => N + 1

Test Plan: Add test cases in compaction_job_test to cover this case. Also add a test case to make sure normal max_compaction_bytes limit is still satisfied.
@facebook-github-bot
Copy link
Contributor

@siying updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@yoshinorim
Copy link
Contributor

yoshinorim commented Mar 14, 2017

@siying After running db_bench, was the number of sst files noticeably changed with your diff (were many smaller sst files created with your diff)?

@siying
Copy link
Contributor Author

siying commented Mar 14, 2017

@yoshinorim for this benchmark, it didn't. I am worried about some extreme cases where many small files may be created, and I am thinking of ways to mitigate the risk.

while (grandparent_index < grandparents.size() &&
icmp->Compare(internal_key,
grandparents[grandparent_index]->largest.Encode()) >
0) {
if (seen_key) {
overlapped_bytes += grandparents[grandparent_index]->fd.GetFileSize();
if (grandparent_index == 0 ||
grandparents[grandparent_index]->fd.GetFileSize() >
compaction->max_output_file_size() / 8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, but should not this condition
grandparents[grandparent_index]->fd.GetFileSize() > compaction->max_output_file_size() / 8
be true most of the time ?

Does that mean that we limit the number of grandparents to be maximum 2 files ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grandparent_index_moved is set to 0 in line 174, so only if the parent index moves twice when we add a key we skip it. We can only move the index twice if the next file is totally between the current key and previous key.

@IslamAbdelRahman
Copy link
Contributor

will setting a small number for max_compaction_bytes give us the same effect ?

@siying
Copy link
Contributor Author

siying commented Mar 20, 2017

@IslamAbdelRahman setting a small max_compaction_bytes will make files always cut for the boundary of grandparent files. This will always generate very small output files.

@gfosco
Copy link
Contributor

gfosco commented Jan 10, 2018

Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue.

@siying
Copy link
Contributor Author

siying commented Dec 23, 2020

I think it's still an idea worth discussing. CC @ajkr

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea to exploit ShouldStopBefore()'s key-by-key consideration to cut files to minimize overlap!

Maybe add a TODO to take range tombstone into account?

@@ -171,12 +171,18 @@ struct CompactionJob::SubcompactionState {
const std::vector<FileMetaData*>& grandparents = compaction->grandparents();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to address for this PR - I feel like this function should operate at the granularity of atomic compaction unit rather than individual files. Particularly now that we cut to avoid overlapping a unit (currently a file), it is only beneficial if that unit also happens to be an atomic compaction unit, meaning it would not get pulled into its neighbors' compactions.

while (grandparent_index < grandparents.size() &&
icmp->Compare(internal_key,
grandparents[grandparent_index]->largest.Encode()) >
0) {
if (seen_key) {
overlapped_bytes += grandparents[grandparent_index]->fd.GetFileSize();
if (grandparent_index == 0 ||
Copy link
Contributor

@ajkr ajkr Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really figure out the condition. The size of the file at the initial value of grandparent_index seems it shouldn't matter since we've already overlapped that file and are now considering whether to overlap the next file.

How about something like this:

  • a copy of the grandparent_index at the start of the function, call it prev_grandparent_index
  • a flag, nontrivial_file_skipped, which indicates if we've skipped past a file of meaningful size whose index comes after prev_grandparent_index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can do the first bullet point and then have a counter, file_data_skipped, which counts the cumulative size of all files that were skipped over. Then we can base the decision to cut the file on the total data that can be saved from future compaction, rather than whether there was a single file that was big enough.

@ajkr
Copy link
Contributor

ajkr commented Oct 16, 2021

@yoshinorim for this benchmark, it didn't. I am worried about some extreme cases where many small files may be created, and I am thinking of ways to mitigate the risk.

Can we write down what is the extreme case? I forget what we discussed last time, but the one that comes to mind now is:

  • Grandparent level has files like [aa az] [ba bz] [ca cz] ...
  • When we write keys a1, b1, c1, to parent level, they each get their own file so it looks like: [a1] [b1] [c1]
  • Can it get worse from here? Writing a2, b2, c2, could result in parent level like: [a1] [a2] [b1 b2] [c1 c2].

@ajkr
Copy link
Contributor

ajkr commented Oct 17, 2021

@yoshinorim for this benchmark, it didn't. I am worried about some extreme cases where many small files may be created, and I am thinking of ways to mitigate the risk.

Can we write down what is the extreme case? I forget what we discussed last time, but the one that comes to mind now is:

  • Grandparent level has files like [aa az] [ba bz] [ca cz] ...
  • When we write keys a1, b1, c1, to parent level, they each get their own file so it looks like: [a1] [b1] [c1]
  • Can it get worse from here? Writing a2, b2, c2, could result in parent level like: [a1] [a2] [b1 b2] [c1 c2].

Besides the possible solution we discussed before (pulling in adjacent tiny files in output level), another possible solution is recompacting tiny files in a later compaction. It might work to first prioritize trivial moving files as far down as possible. Then have a new type of "file size" compaction that compacts groups of adjacent small files within the same level (e.g., smaller than target size / 2). If it's in a non-bottommost level there should be an additional restriction that such a compaction does not include files that overlap multiple files in the parent level. This "file size" compaction could be similar priority to marked files, intra-bottommost, etc. (i.e., score 1.0). The downside is tiny file count can grow continuously during a write burst with a particular pattern.

edit: This solution could also help the problem described in #8306.

facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2022
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.

In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).

db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432

# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds

# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```

The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
          baseline      this_PR
count  1.656000e+03  1.705000e+03
mean   3.116062e+07  3.028076e+07
std    7.145242e+06  8.046139e+06
```

The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`

Also includes #1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.

Pull Request resolved: #10655

Reviewed By: cbi42

Differential Revision: D39552321

Pulled By: jay-zhuang

fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
@lightmark
Copy link
Contributor

close?

@ajkr
Copy link
Contributor

ajkr commented Nov 1, 2022

Yeah, #10655 included this idea

@ajkr ajkr closed this Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants