Staging compaction 1st draft #7442

itaiad200 · 2024-02-06T09:02:19Z

Proposal for #2092

github-actions · 2024-02-06T09:02:31Z

No linked issues found. Please add the corresponding issues in the pull request description.
Use GitHub automation to close the issue when a PR is merged

github-actions · 2024-02-06T09:07:00Z

E2E Test Results - DynamoDB Local - Local Block Adapter

arielshaqed

This is beautiful and fits in very nicely with the existing code in Graveler. Yes, please!

We might add a section explicitly stating the advantages by counting the operations needed with and without compaction.

arielshaqed · 2024-02-06T12:01:23Z

design/open/staging-compaction.md

+There might be other cases where the structure of the `Branch` entity impacts 
+the performance of reading from the staging area, for example when the 
+number of sealed tokens is large (N)[^1] and reading a missing entry requires 
+reading from all N sealed tokens. 


Also I'm putting together a separate proposal for partitioning the staging area. It is much simpler and more efficient to partition only forwards in time and not change anything already written. So we will want to compact staging as a way to reduce read load on already-written parts, by removing KV partitions entirely!

I'll love to read that proposal. I'm having hard time understanding how compacting partitions work 😅

arielshaqed · 2024-02-06T12:04:27Z

design/open/staging-compaction.md

+based on the number of sealed tokens, although not necessarily a priority 
+for the first version. We can probably avoid the need to query the kv-store 
+to retrieve that information (except for the service startups) by caching 
+the information in the Sensor.


It is hard to cache the number of deletions locally in a cluster of lakeFSes. One option might be for each lakeFS instance to select a unique ID and store its deletions count on KV some key. Writes are immediately uncontended, and write load can be reduced by batching the increment for a short while.

design/open/staging-compaction.md

nopcoder · 2024-02-06T22:31:56Z

design/open/staging-compaction.md

+There might be other cases where the structure of the `Branch` entity impacts 
+the performance of reading from the staging area, for example when the 
+number of sealed tokens is large (N)[^1] and reading a missing entry requires 
+reading from all N sealed tokens. 


As far as I know multiple sealed tokens case is rare.
Is there an open open performance issue where we found more than one token that cause performance issue?

Nope. Well, I did see this when experimenting with very many concurrent merges. But there are other reasons why those are slow. Still, once I get to split staging tokens this might become more common.
Definitely something for later, agree with @itaiad200 that initially it least we should trigger this due to multiple deletions -- it will definitely help there immediately.

It's indeed rare as every sealed token represents a fail operation on the branch, so I don't think it has a big impact.

nopcoder · 2024-02-06T22:46:23Z

design/open/staging-compaction.md

+- Preserve lakeFS consistency guarantees for all scenarios.
+- A more efficient way to read from branches with a large number of 
+  tombstones (i.e. Fix the issue mentioned above).
+- Preserve lakeFS performance for other scenarios.[^2]


I think that "other scenarios" can cause also be accessing the blockstore. Iterate over metarange can cause slowdown while we requrire to download blocks (unless they are cached). The currently iterator uses next for both uncommitted and committed so I assume we can say that we have the same tombstone issue while we scan over committed data if the staging/uncommtited entries are full of changes too and we "pay" in performance while scan data that we will served but still scan it.

I don't think it's true. The CompactedMetaRange will not contain just the uncommitted entries. It will contain a snapshot of the branch at the time of the compaction, i.e. as if commit ran at the same time. Meaning, tombstones will not be part of the CompactedMetaRange as we do not store deletions in a commit. So we can't have the same tombstone issue.
However, you're pointing on a different potential problem. We're replacing load from the kv-store with load on the object store.

nopcoder · 2024-02-06T22:57:12Z

design/open/staging-compaction.md

+### Commit Flow
+
+The commit flow is slightly affected by the compaction process. If 
+compaction never happened, the commit flow is the same as today. If a 
+compaction happened, apply the changes to the compacted metarange instead of 
+the HEAD commit metarange.


Guess we will need to address "empty commit" differently in the case of compaction metarange.

Why are you calling out empty commit specifically? The view of the staging remains the same. You look at staging token + sealed tokens + compaction (if available) and see if they present any changes to the commit.

Wait... we are overloading the word "commit" here and I think that it leads to some confusion. AFAICT we just need to set AllowEmpty Let me see if I understand how compaction works. I think we are talking about a scenario like this one: I add many (100_000) objects staging, and delete them without committing.

Now I have a staging area with 100_000 tombstones¹. A few seconds later I want to compact. This is not a whole commit! We generate a new metarange using committed.Commit -- this is below Graveler. We told it to AllowEmpty. It runs over the previous base metarange (branch.CompactedMetaRange or, if we don't have, the metarange of branch.CommitID and applies all the tombstones. Nothing happens, so it generates the exact same ranges and metaranges as before. When we close the MetaRangeWriter we get the exact same metarange ID. It only deletes the staging token and the sealed tokens, the base metarange ID stays.

So AFAICS this is the same flow with no changes. The "no empty commits" rule just means we need to set AllowEmpty here.

Footnotes

Reminder: we cannot delete the key from the staging area, because the original uncommitted write might have overwritten a previous object. We never scan backwards, and doing so on deletions would make them expensive. ↩

arielshaqed

Just replies to @nopcoder , I think this design is a small, compact, and useful change that will help resolve a current issue users have.

arielshaqed · 2024-02-07T08:01:00Z

design/open/staging-compaction.md

+There might be other cases where the structure of the `Branch` entity impacts 
+the performance of reading from the staging area, for example when the 
+number of sealed tokens is large (N)[^1] and reading a missing entry requires 
+reading from all N sealed tokens. 


Nope. Well, I did see this when experimenting with very many concurrent merges. But there are other reasons why those are slow. Still, once I get to split staging tokens this might become more common.
Definitely something for later, agree with @itaiad200 that initially it least we should trigger this due to multiple deletions -- it will definitely help there immediately.

arielshaqed · 2024-02-07T08:23:23Z

design/open/staging-compaction.md

+### Commit Flow
+
+The commit flow is slightly affected by the compaction process. If 
+compaction never happened, the commit flow is the same as today. If a 
+compaction happened, apply the changes to the compacted metarange instead of 
+the HEAD commit metarange.


Wait... we are overloading the word "commit" here and I think that it leads to some confusion. AFAICT we just need to set AllowEmpty Let me see if I understand how compaction works. I think we are talking about a scenario like this one: I add many (100_000) objects staging, and delete them without committing.

Now I have a staging area with 100_000 tombstones¹. A few seconds later I want to compact. This is not a whole commit! We generate a new metarange using committed.Commit -- this is below Graveler. We told it to AllowEmpty. It runs over the previous base metarange (branch.CompactedMetaRange or, if we don't have, the metarange of branch.CommitID and applies all the tombstones. Nothing happens, so it generates the exact same ranges and metaranges as before. When we close the MetaRangeWriter we get the exact same metarange ID. It only deletes the staging token and the sealed tokens, the base metarange ID stays.

So AFAICS this is the same flow with no changes. The "no empty commits" rule just means we need to set AllowEmpty here.

Footnotes

Reminder: we cannot delete the key from the staging area, because the original uncommitted write might have overwritten a previous object. We never scan backwards, and doing so on deletions would make them expensive. ↩

ozkatz · 2024-02-18T07:55:25Z

Return of the (terribly named) partial commit!

Not to be confused, of course, with partial commit.

guy-har

Loved the design, very clear and convincing.
I commented only on three areas where I'm missing some additional information
Thanks!!!

guy-har · 2024-02-20T09:48:05Z

design/open/staging-compaction.md

+information on writes to the staging area. It will decide when to compact a 
+certain branch based on the number of deleted entries to its staging area, 


Just a thought according to "based on the ..."
Do we have any statistics on cases that do bulk delete? IIRC there are spark jobs that do a bunch of deletes. working by a fix number of deleted entries may result in many compaction during a bulk delete.
Another example I can think of is moving a directory (that exists on staging), doing "commits" along the way may result in commits that each of them is "heavier than the only commit that would've been done otherwise.

I don't have any statistics.

guy-har · 2024-02-20T09:58:08Z

design/open/staging-compaction.md

+### Compactor
+
+Upon deciding to compact, the Sensor will trigger the Compactor. The 
+Compactor will perform an operation very similar to Commit. Before starting 
+to compact, it will atomically:


What about the "garbage", this compactor may create many "dangling" ranges. Do have any estimation to how this would effect or if it might be an issue?

It will increase the number of dangling commits significantly, let's say by an order of a magnitude? We'll need to handle that at some point in some GC process.

guy-har · 2024-02-20T10:24:58Z

design/open/staging-compaction.md

+1. Read from the staging token.
+2. Read from the sealed tokens (in order).
+3. If a CompactedMetaRangeID exists, read the 2-way diff between the compacted
+   metarange and the CommitID's metarange.


There still is a step I think I'm missing, what do I do with the data from the sealed (and staging) tokens and the data from the diff. There is some kind of "uncommitted merge" that we need to do.

This already exists today. Assume the following simple branch scenario today:

StagingToken: {A:{foo1:barA}} SealedTokens: [{B:{foo1:barB,foo2:gooB}} , {C:{foo1:barC,foo2:gooC}}] Commit: [{foo1:barCommitted,foo2:gooCommitted,,foo3:jooCommitted}]

Reading foo1,foo2,foo3 from this branch yields barA,gooB,jooCommitted. For reading from the uncommitted area, you just skip the last layer with reading the commit.
With compaction the same algo holds, but you need to diff the CompactedMetaRange with the commit for the changes that are already manifested in the compaction

design/open/staging-compaction.md

Isan-Rivkin · 2024-02-20T12:55:03Z

design/open/staging-compaction.md

+number of sealed tokens is large (N)[^1] and reading a missing entry requires 
+reading from all N sealed tokens. 
+
+## Goals


I don't see any reference to visibility into this internal feature.
To me this is parallel to having a GC, I wanna know if it worked, when it runs, how much time did it take etc.
If you agree can you add a small section about the visibility requirements?

As a user, why do you want to know if compaction worked? And how effective was it? You only care about it when performance degrades and you need to drill down further. I guess most users will never hear of that.
GC is different. It's an external process I need to manage by myself (or pay someone to do it) and it results in storage costs.
We should have metrics for it just like with any other flow in Graveler.

First, sorry for not being clear - I meant GC in JVM not lakeFS.
And this is super internal but, we must have some visibility into the performance of compaction isolated from the rest.
The reason is because the sensor would evolve over time into different directions with heuristics / tradeoffs / configuration.
Having visibility is important.

I think that this should be mentioned as part of the design, otherwise this can easily be missed and "shipped later" or done with an after thought.

I think prometheus metrics are more than enough for the sophisticated lakeFS user to tweak what needs to be tweaked. What new metrics would you like me to call out in this design?

compaction successful runs number, compaction failures number or total compaction runs.

So a single event compaction_run {success:bool}

guy-har · 2024-02-21T12:09:47Z

design/open/staging-compaction.md

+### Commit Flow
+
+The commit flow is slightly affected by the compaction process. If 
+compaction never happened, the commit flow is the same as today. If a 
+compaction happened, apply the changes to the compacted metarange instead of 


What if a compaction happened during a commit? will this fail the commit? If that is the case, are we fine with compaction failing commits?

Staging compaction 1st draft

7cd16c7

itaiad200 added improvement proposal area/KV Improvements to the KV store implementation labels Feb 6, 2024

itaiad200 requested review from nopcoder, ozkatz, N-o-Z, arielshaqed, Isan-Rivkin, idanovo, guy-har, lynnro314, Jonathan-Rosenberg and eladlachmi February 6, 2024 09:02

itaiad200 added the exclude-changelog PR description should not be included in next release changelog label Feb 6, 2024

arielshaqed approved these changes Feb 6, 2024

View reviewed changes

nopcoder reviewed Feb 6, 2024

View reviewed changes

arielshaqed reviewed Feb 7, 2024

View reviewed changes

guy-har approved these changes Feb 20, 2024

View reviewed changes

Isan-Rivkin requested changes Feb 20, 2024

View reviewed changes

design/open/staging-compaction.md Show resolved Hide resolved

design/open/staging-compaction.md Show resolved Hide resolved

design/open/staging-compaction.md Show resolved Hide resolved

Isan-Rivkin requested changes Feb 20, 2024

View reviewed changes

PR comments

d485ffb

guy-har reviewed Feb 21, 2024

View reviewed changes

Isan-Rivkin mentioned this pull request Feb 21, 2024

Add a Compaction Sensor #7497

Closed

itaiad200 requested a review from Isan-Rivkin February 22, 2024 08:32

Isan-Rivkin approved these changes Feb 22, 2024

View reviewed changes

itaiad200 added the minor-change Used for PRs that don't require issue attached label Feb 22, 2024

itaiad200 merged commit 3fdd98d into master Feb 22, 2024
36 of 37 checks passed

itaiad200 deleted the 2092-staging-compaction-proposal branch February 22, 2024 11:33

guy-har mentioned this pull request Feb 22, 2024

Staging Compactor #7503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging compaction 1st draft #7442

Staging compaction 1st draft #7442

itaiad200 commented Feb 6, 2024

github-actions bot commented Feb 6, 2024

github-actions bot commented Feb 6, 2024

arielshaqed left a comment

arielshaqed Feb 6, 2024

itaiad200 Feb 7, 2024

arielshaqed Feb 6, 2024

nopcoder Feb 6, 2024

arielshaqed Feb 7, 2024

itaiad200 Feb 7, 2024

nopcoder Feb 6, 2024

itaiad200 Feb 7, 2024

nopcoder Feb 6, 2024

itaiad200 Feb 7, 2024

arielshaqed Feb 7, 2024

arielshaqed left a comment

arielshaqed Feb 7, 2024

arielshaqed Feb 7, 2024

ozkatz commented Feb 18, 2024

guy-har left a comment

guy-har Feb 20, 2024

itaiad200 Feb 20, 2024

guy-har Feb 20, 2024

itaiad200 Feb 20, 2024

guy-har Feb 20, 2024

Isan-Rivkin Feb 20, 2024

itaiad200 Feb 20, 2024

Isan-Rivkin Feb 20, 2024

itaiad200 Feb 20, 2024

Isan-Rivkin Feb 21, 2024

itaiad200 Feb 22, 2024

Isan-Rivkin Feb 22, 2024

itaiad200 Feb 22, 2024

guy-har Feb 21, 2024

itaiad200 Feb 22, 2024

		information on writes to the staging area. It will decide when to compact a
		certain branch based on the number of deleted entries to its staging area,

Staging compaction 1st draft #7442

Staging compaction 1st draft #7442

Conversation

itaiad200 commented Feb 6, 2024

github-actions bot commented Feb 6, 2024

github-actions bot commented Feb 6, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

ozkatz commented Feb 18, 2024

guy-har left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment