-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] An Adaptive merge policy with better Write and Space amplification #10340
Comments
Would appreciate feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, @reta and @backslasht as well, so tagging more folks for visibility. Thanks! |
This is copy/paste from RocksDB? I'm assuming you mean multiple lucene "segments" instead of SST files?
Am I tracking correctly that Lucene segments at Li are only merged with segments at Li+1 once the cumulative data size of Li reaches it's upper limit? |
I really like the idea of an adaptive merge policy! It falls in the category of "auto tuning" which is especially nice for serverless. The concern I have here is introducing the idea of levels. @mikemccand wrote the book on these policies so I'm curious what he thinks. |
Thanks @RS146BIJAY |
Nice, looks promising for workloads where update/delete is very frequent. Thanks @RS146BIJAY. Also, it looks like the primary motive of the proposal is to reclaim deleted documents as much as we can through tuning merges. |
Thanks @RS146BIJAY for the proposal. It would be great if you can share the gains/improvements that you are observing when using the |
Below result compares how obsolete entries degrade search performance by comparing search latency when there is no obsolete entries vs search latency when there is 20 % obsolete entries (maximum obsolete entries that OpenSearch can have). We will be using below numbers as baseline for evaluating our level merge policy implementation. The idea is new merge policy can merge aggressively to keep obsolete entry low when search traffic is high. So maximum gain we can get from new merge policy is as maximum performance degradation with current merge policy due to obsolete entry. We used
|
Are you proposing to dynamically close and re-open the IndexWriter based on search load? Or change OpenSearch's default MergePolicy with an Adaptive one that delegates to TieredMergePolicy or LogByteSizeMergePolicy and auto-tunes those policy settings based on search load? How do you plan to reproducibly benchmark those real world dynamics? Maybe there's an opportunity to add some auto-tuning logic to Also curious on answers to some of the previous questions re: TieredMergePolicy's greedy design at odds with the proposed level groups thus preventing large segments from merging with smaller ones (please correct me if I'm tracking the proposal incorrectly). Curious to get some other folks looking at this as well so they can track and weigh in and ask additional questions. /cc @rishabh6788 @msfroh @andrross @mch2 @dreamer-89 |
Tagging the correct Rishabh @rishabhmaurya :) |
Thank you! I think I made that mistake before 🤦🏻♂️ SryAboutThat! |
@RS146BIJAY Thanks for looking into ideas to improve merge policies. In your proposal there are lot of terminologies copied from RocksDB paper which doesn't directly translates to Lucene very well. RocksDB is a key value store and all the query types you have mentioned The main difference I see between 2 approaches is RocksDB Level merge policy is trying to merge a segment from lower level to the segments in the next level, another way to put it is merging a relatively small segment(s) with larges ones at next level, which, I agree, would result in cleaning up a lot of obselete docs when compared to tiered merge policy(which always considers segments at same level and of similar sizes to merge for lower write amplification) but it trades off write amplification and cascade merging of segment to lower level (tiered merge policy is pretty good at it as it works on budget model). Also, TieredMergePolicy works on choosing segments where reclaim from deleted documents is high, so it already tries to address the problem. As mentioned earlier by @gashutos, I would work backward on what problem we are addressing, if its update/read heavy, I would play by customizing existing TieredMergePolicy to reclaim deleted docs more aggressively (if its not doing already or isn't possible to tweak one of the existing settings) and try to address the concern instead of introducing the whole new merge policy. I agree with @nknize that idea of having adaptive merge policy (or merge policy settings) depending on workload is quite fascinating. I would love to hear the answer which Nick raised on how auto tuning will translate internally. PS: I think both discussions could be separated out - 1. Use of Level merge policy/optimizing merge policy for read/update heavy use cases 2. Use of adaptive merge policy. |
Did not got this entirely. You meant dynamically close and re-open the IndexReader which happens during reader refresh? Or this is for changing underlying merge policy and closing and reopening IndexWriter?
Proposal is to have a new single Adaptive merge policy. Here by Adaptive I meant a single policy can behave as both Tiered as well as Leveled merge policy (as per workload) by controlling frequency of merging at initial L-1 levels and last level L. We are controlling the number of segments in initial L-1 levels by a parameter K and last level by a parameter Z. This Fluid merge policy will initially have the configurations of Lazy leveling with more segments at initial level and only segment at last level. Now suppose the read traffic is high it will merge more aggressively at the initial L - 1 levels, making it behave like Leveled merge policy. Incase write traffic goes high, it can reduce the number of merges at the last level making it behave like Tiered merge policy. PS: I am mentioning K and Z in terms of segments here, ideally it should be runs (which is bascially a collection of segments). Also Tiered Merge policy mentioned here is slightly different than that Lucene is using, so used the term
I think the confusion here is by Adaptive we are switching merge policies as per load. This is not the case. Adpative merge policy means the same new merge policy will be behaving like Tiered, Leveled or Lazy leveling as per requirement.
For above benchmark results where we updated about 20% of the documents, we saw this issue. The 5 GB limit on the max segment size was causing larger segments (around 5 GB size) to have more obsolete entries (around 60% obsolete entries in one single segment) which started impacting search performance. Essentially this is preventing larger segments to be merged less frequently or not getting merged at all causing search performance degradation. PS: For updating 20% entries, we ran an update by query command on |
Merge policies and schedulers are set on the IndexWriter which is configured through the OpenSearch Engine. I thought you were suggesting an adaptive merge scheduler approach that would use a new LevelMergePolicy similar to OpenSearchTieredMergePolicy that groups segments and delegates and tunes the settings of the optimal MergePolicy based on search AND indexing workload. In this case, I agree w/ @rishabhmaurya that these conversations can be teased into separate discussion topics. It seems this proposal is mostly just focused around reclaiming deletes which has had quite a bit of improvement over the last few years. Have a look at the simple LUCENE-10574 PR, for example. It would be good to open this discussion upstream and propose this for Lucene in general to get feedback from other committers. You can also prototype and unit test your proposal quicker using
Just to clarify here, the DEFAULT_DELETE_PCT_ALLOWED is 20% which corresponds with the Lucene default. |
Ack. Will try to separate out discussions for Workload based adaptive merge policy with Fluid LSM merge policy that we suggested in this proposal. |
@RS146BIJAY These are skewed results because deleted documents are pretty high on single segment the way we updated documents (Like couple of segments with 60% deleted documents and rest of the segments doesnt have any deleted entries). I think we need to see with different percentage of deleted documents, how search performance is behaving.
@nknize If segment is already 5 GB, I think in that case, |
|
That just avoids going over 5GB (or the configured max segment size) when using In other words, you can use |
@gashutos I believe |
We should try to tune (new implementation) this limit at segment level. |
Problem
Currently OpenSearch uses an implementation of Tiered Merge policy for compaction. Tiered Merge policy is suited for write intensive workload as it has lower write amplification. A disadvantage of Tiered Merge policy is its Read and Space amplification is high. For merging a set of segments it requires 3 times the total size of segments that are merged (as Compound files are enabled), increasing space amplification. Also obsolete entries (created when there is updates or deletion) are slowly removed from the segments in tiered merge policy (as compared to Level Merge policy), increasing read amplification.
What are other alternatives
Leveled Merge Policy
Leveled merge policy is used for read intensive workload. In Leveled Compaction as well data on disk are organized in multiple levels. The size of each level is maintained at T times that of the previous level (generally T = 10). A special level-0 (or L0 for short) contains files just flushed from in-memory write buffer (memtable). All non-0 levels have target sizes. Inside each level (except level 0), data is range partitioned into multiple SST files. The size targets are usually exponentially increasing.
Segment merge gets triggered when the data size of a level reaches its upper limit. In that case, segments at the current level will be pushed to and merged with the segments of next level. The resultant segments will be placed in the next level. Since merging at current level happens every time previous level gets filled up and a new set of segments is pushed to this level, read amplification (since obselete entry will be less) and space amplification (since only a small set of segment is pushed and merged from previous level) is less as compared to tiered merge policy.
Issue with this merge policy is write amplification is very high, so it is suited for read intensive workload.
Leveled Merge Policy with lazy leveling
Leveled Merge policy with lazy leveling improves write amplication for Leveled merge by eliminates merging at all but the largest level. Lazy leveling at its core is a hybrid of leveling and tiering: it applies leveling at the largest level and tiering at all other levels.
In order to keep the cost complexity of point lookups fixed despite having more segments to probe at smaller levels, we keep FPR at lower level better than FPR at larger level. Therefore, relative to leveling with lazy leveling, we are able to (1) improves the cost complexity of updates, (2) maintains the same complexity for point lookups, long range lookups, and space-amplification, and (3) provides competitive performance for short range lookups.
Worst case cost comparisons between merge policies for different operations
where,
L = number of levels
T = size ratio between adjacent levels
M = main memory allocated to the Bloom filters
N = total number of entries
s = selectivity of a range lookup
B = number of entries that fit into a storage block
P = size of the buffer in storage blocks
Analysis summary
As it can be inferred from the above analysis, no single merge policy dominates the others universally. Lazy leveling is best for combined workloads consisting of updates, point lookups and long range lookups, whereas tiering and leveling are best for workloads comprising mostly updates or mostly lookups, respectively.
Proposal - Fluid LSM merge policy
Fluid LSM is an adaptive merge policy that generalises all the above merge policies (Tired, Leveled and Lazy leveling). It enables switching between different merge policies as per the workload. It does this by controlling the frequency of merge operations separately for the largest level and for all other levels.
In fluid LSM tree, there are at most Z runs at the largest level and at most K runs at each of the smaller levels. To maintain these bounds, every Level i has an active run into which we merge incoming runs from Level i − 1. Each active run has a size threshold with respect to the capacity of its level (total capacity comes out to T): T/K percent for Levels 1 to L − 1 and T/Z percent for Level L. When an active run reaches this threshold, we start a new active run at that level. Ultimately when a level is at capacity, all runs in it get merged and flushed down to the next level.
The bounds K and Z are used as tuning parameters that enable Fluid LSM-tree to assume the behaviors of different merge policies.
Fluid LSM-tree can transition from Lazy Leveling to tiering by merging less frequently at the largest level by increasing Z, or it can transition to leveling by merging more frequently at all other levels by decreasing K.
Further steps
Reference Paper
https://scholar.harvard.edu/files/stratos/files/dostoevskykv.pdf
The text was updated successfully, but these errors were encountered: