Proposal: Uncommitted GC #4382

N-o-Z · 2022-10-18T11:31:32Z

Closes #1933

itaiad200

I think you're on to something, great work! See my comments inline.

design/open/uncomitted-gc.md

itaiad200 · 2022-10-18T12:25:04Z

design/open/uncomitted-gc.md

+#### Objects Path Conventions
+
+1. Repository objects will be stored under the prefix `<bucket_name>/repos/<repo_uid>`
+2. Branch objects will be stored under the repo prefix with the path `branches/<branch_id>/`


What is a Branch Object and what is a Repository Object? I think we need terminology section.

Rephrased - let me know if it clarifies it - or that I should add a terminology section

itaiad200 · 2022-10-18T12:27:00Z

design/open/uncomitted-gc.md

+
+1. Repository objects will be stored under the prefix `<bucket_name>/repos/<repo_uid>`
+2. Branch objects will be stored under the repo prefix with the path `branches/<branch_id>/`
+3. Each lakeFS instance will create a unique prefix partition (serialized) under the branch path to 


Will a standalone lakeFS storageNamespace structure look different than a scaled lakeFS? What if the container restarts, should it use the same unique prefix?

Effectively it will look the same.
Each lakeFS instance will maintain it's own partitions appending a unique suffix to the partition name, in the following manner:
<sortable_descending_serialized_uid>_<lakefs_instance_uid>
Each instance will manage an exclusive partition - on container restart the lakeFS instance will created a new partition according to the above conventions.
The current idea is to track partition size in memory - this means a restarted instance does not have context on the existing partitions, effectively leaving partitions partially allocated. If you think this is not sufficient we might want to think of an additional solution on how to track the partitions

Updated document

itaiad200 · 2022-10-18T13:22:02Z

design/open/uncomitted-gc.md

+3. Each lakeFS instance will create a unique prefix partition (serialized) under the branch path to 
+store the branch objects.  
+The serialized partition prefix will allow partial scans of the bucket when running the optimized GC
+4. lakeFS will track the count of objects uploaded to the prefix and create a new one every < TBD > objects uploaded


Is it a hard bound? Might be hard to implement.

As mentioned above this is an upper bound, which might not be fulfilled in the case of the instance restart / shutdown

It still requires serializing all uploads.

design/open/uncomitted-gc.md

itaiad200 · 2022-10-18T13:58:33Z

design/open/uncomitted-gc.md

+5. Subtract results `lakeFS DF` from `Branch DF`
+6. Filter files newer than < TOKEN_EXPIRY_TIME > and special paths
+7. The remainder is a list of files which can be safely removed
+8. Finally, save the current run's `GC commit` the last read partition and newest commit id


Per branch? Where?

Per branch - updated

itaiad200 · 2022-10-18T14:02:44Z

design/open/uncomitted-gc.md

+2. Read addresses from branch's new commits (all commits up to the last GC run commit id) -> `lakeFS DF`
+3. Read addresses from branch `GC commit` -> `lakeFS DF`
+4. Subtract results `lakeFS DF` from previous run's `GC commit`
+5. The result is a list of files that can be safely removed


itaiad200 · 2022-10-18T14:04:30Z

design/open/uncomitted-gc.md

+##### Step 1. Analyze Data and Perform Cleanup for old entries (GC client)
+
+1. Run PrepareUncommittedForGC
+2. Read addresses from branch's new commits (all commits up to the last GC run commit id) -> `lakeFS DF`


Suggested change

2. Read addresses from branch's new commits (all commits up to the last GC run commit id) -> `lakeFS DF`

2. Read addresses from branch's new commits (all commits from to the last GC run commit id) -> `lakeFS DF`

Since log commits returns commits from newest to last, shouldn't it be "down to"?

itaiad200 · 2022-10-18T14:05:31Z

design/open/uncomitted-gc.md

+>**Note:** This step handles cases of objects that were uncommitted during previous GC run and are now deleted
+
+##### Step 2. Analyze Data and Perform Cleanup for new entries (GC client)
+1. Read all objects on branch path up to the previous run's last read partition (can be done in parallel by 'partition') -> `Branch DF`


Suggested change

1. Read all objects on branch path up to the previous run's last read partition (can be done in parallel by 'partition') -> `Branch DF`

1. Read all objects on branch path up from the previous run's last read partition (can be done in parallel by 'partition') -> `Branch DF`

It's actually down to the previous run's last read partition, since we are utilizing the adapter's list property to retrieve the entries sorted alphabetically

design/open/uncomitted-gc.md

ozkatz · 2022-10-18T14:47:04Z

Great ideas @N-o-Z! This is a great direction to explore.
I'm wondering:

Why are we scoping each "partition" with a branch? Isn't it simpler to simply have all partitions live under the storage namespace directly?
The concept of partitions is really nice! Perhaps we can even start with a simpler implementation that isn't incremental? because partitions are more-or-less bounded in size, you've also solved prior design's limitation with object listing that couldn't be parallelized! do a prefix listing once, then have the executors divvy up these prefixes (with little to no skew since they are all pretty much the same size!)

still O(n) but at least it doesn't have to execute serially..

itaidavid · 2022-10-18T19:57:16Z

design/open/uncomitted-gc.md

+#### Flow 1: Clean Run
+
+1. Run PrepareUncommittedForGC
+2. Read all addresses from branch commits -> `lakeFS DF`


What's DF?

Data Frame - since eventually this design is designated to be implemented over a Spark job, I found using this terminology the most sensible

How about using committedDF? It is more self explanatory

itaidavid · 2022-10-18T19:58:48Z

design/open/uncomitted-gc.md

+3. Each lakeFS instance will create a unique prefix partition (serialized) under the branch path to 
+store the branch objects.  
+The serialized partition prefix will allow partial scans of the bucket when running the optimized GC
+4. lakeFS will track the count of objects uploaded to the prefix and create a new one every < TBD > objects uploaded


Create a new what? Prefix (i.e. unique prefix?)

Rephrased - let me know if it's clearer

It is. Thanks

N-o-Z · 2022-10-18T20:04:40Z

Great ideas @N-o-Z! This is a great direction to explore. I'm wondering:

Why are we scoping each "partition" with a branch? Isn't it simpler to simply have all partitions live under the storage namespace directly?

The concept of partitions is really nice! Perhaps we can even start with a simpler implementation that isn't incremental? because partitions are more-or-less bounded in size, you've also solved prior design's limitation with object listing that couldn't be parallelized! do a prefix listing once, then have the executors divvy up these prefixes (with little to no skew since they are all pretty much the same size!)

still O(n) but at least it doesn't have to execute serially..

Thank you @ozkatz

scoping the "partitions" to branches allows us to scope our views of the committed and uncommitted objects to the branch level as well. It in fact allows us to run the GC process on the branch level and enables the concurrency. If we created the partitions on the repository scope (namespace), we would have to go over all committed and uncommitted data for the entire namespace in order to ascertain which objects can be deleted.
We can definitely start with the clean flow, and add the optimized flow later.
Though we need to take several things into mind:
1. Even when parallelized, testing S3 listing of a path with ~500,000 objects on a databrick notebook running on the given configuration took around 25 seconds:
  
  taking for example a repo with around 1 billion objects (not so far fetched as I understand), even if it is perfectly partitioned into bulks of 500,000 objects, the entire listing might take an unreasonable amount of time
2. While relying only on the clean flow this becomes is a continually growing problem, both the bucket size and commits are unbounded and eventually we will reach an unmanageable size

itaidavid · 2022-10-18T22:35:11Z

design/open/uncomitted-gc.md

+
+The following describe the GC process run flows in the branch scope:
+
+#### Flow 1: Clean Run


If I understand correctly, it`s a part of our current GC, right? Deleted-uncommited and expired-committed will be deleted in the same run?

I believe it should be incorporated into the same ecosystem. Whether it is the same job, or a different one is to be considered. I'm not sure deciding this now is a requirement for this proposal

@N-o-Z let's make sure we do discuss this detail before settling on a final design. I'm fine with separating the algorithmic discussion from job architecture but there are trade offs we need to consider here.

itaidavid · 2022-10-19T02:30:19Z

design/open/uncomitted-gc.md

+
+#### PrepareUncommittedForGC
+
+A new API which will create meta-ranges and ranges using the given branch uncommitted data. These files


So, this is kinda "fake" commit with all uncomitted-undeleted objects, correct?
Pretty cool

Credit to @itaiad200 😄

itaidavid · 2022-10-19T02:50:38Z

design/open/uncomitted-gc.md

+repositories
+* This solution partitions the branch data into segments that allow parallel listing of branch objects. In cases where
+the branch path is extremely large, it might still take a lot of time to perform a clean run.
+* For GC to work optimally, it must be executed on a timely basis. Long periods between runs might result in excessively long runs.


Can we improve it by limiting GC cycles (time limit or number of objects to handle)?

This will be a bit difficult IMO. Since object creation date is loosely coupled with commit date it might be very complicated to achieve

This is an ops issue. I might prefer occasional long runs, or common short runs.

itaidavid

Thank you @N-o-Z. If I understood correctly you suggest to treat all uncommitted (undeleted) objects as a retained commit, and delete all objects that are not a part of this or any other reatined commit, correct?
What I mainly miss in this document is a descitption of the approach, before the dive into detail - some story of the solution behind the details. I think it will help understanding the solution and will make it an eaiser read.

arielshaqed

Thanks! It looks really exciting, but I didn't entirely understand what "partitions" are.

design/open/uncomitted-gc.md

arielshaqed · 2022-10-18T19:12:19Z

design/open/uncomitted-gc.md

+
+1. GetPhysicalAddress to return a validation token along with the address.
+2. The token will be valid for a specified amount of time and for a single use.
+3. LinkPhysicalAddress to verify token valid before creating an entry.


I don't understand why the race here (between token valid and entry creation) is not a problem: what if the instance verifies a valid token, then the token becomes invalid, and then the entry is created?

In particular, "time interval" is often a race -- particularly in difficult conditions when cluster members have poor time sync.

Yes, time sync introduces a challenge here. What if GC thinks a token has expired while the server thinks it's valid?

arielshaqed · 2022-10-18T19:13:07Z

design/open/uncomitted-gc.md

+
+A new API which will create meta-ranges and ranges using the given branch uncommitted data. These files
+will be saved in a designated path used by the GC client to list branch's uncommitted objects.  
+For the purpose of this document we'll call this the `GC commit` (feel free to suggest a better term :) )


It's not a commit, it's "just" a metarange.

arielshaqed · 2022-10-19T07:35:50Z

design/open/uncomitted-gc.md

+A new API which will create meta-ranges and ranges using the given branch uncommitted data. These files
+will be saved in a designated path used by the GC client to list branch's uncommitted objects.  
+For the purpose of this document we'll call this the `GC commit` (feel free to suggest a better term :) )
+
+#### Reading data from lakeFS
+
+Reading data from lakeFS will be similar to the current GC process - using the SSTable reader to quickly list branch objects.
+The above-mentioned API will enable doing so for all branch objects (committed and uncommitted).


What are the advantages of doing this over directly providing an API to list the branch objects? If the goal is "merely" to get a static listing, could we instead provide an API to push a new staging token and return the previous list of staging tokens? Then the GC Spark job could use another (new) API call list all objects on that list of staging tokens.

Otherwise we end up with a strange unused metarange. 🤷🏽

This metarange is used not only for the current GC run, but used later in the next run as a reference to the current state of the branch and enables an important optimization. Afterwards it can be safely removed.
Also I don't think the sealed tokens provide us with a static state - changes can still occur on the branch which can modify the sealed tokens state.

arielshaqed · 2022-10-19T07:37:38Z

design/open/uncomitted-gc.md

+#### Flow 1: Clean Run
+
+1. Run PrepareUncommittedForGC
+2. Read all addresses from branch commits -> `lakeFS DF`


I don't think that that works: these are all addresses to keep, they might occur only on another branch (e.g. if copied from another branch by staging them to this one).

design/open/uncomitted-gc.md

arielshaqed · 2022-10-19T07:39:34Z

design/open/uncomitted-gc.md

+
+## Limitations
+
+* Since this solution relies on the new repository structure, it is not backwards compatible. Therefore, another solution will be required for existing 


Can we have a slower (one-time) run to cleanup old repositories, once?

I think we won't have any option but to do that :). I'm just not sure this should be a part of this proposal. WDYT?

In a separate proposal is great, just let's have it :-)

Added a discovery task:
#4382

arielshaqed · 2022-10-19T07:40:11Z

design/open/uncomitted-gc.md

+##### Step 1. Analyze Data and Perform Cleanup for old entries (GC client)
+
+1. Run PrepareUncommittedForGC
+2. Read addresses from branch's new commits (all commits up to the last GC run commit id) -> `lakeFS DF`


Where do we store this "run ID"?

Added:

Finally, save the current run's GC commit the last read partition and newest commit id in a designated location on the branch path

arielshaqed · 2022-10-19T07:42:59Z

design/open/uncomitted-gc.md

+3. Each lakeFS instance will create a unique prefix partition (serialized) under the branch path to 
+store the branch objects.  
+The serialized partition prefix will allow partial scans of the bucket when running the optimized GC
+4. lakeFS will track the count of objects uploaded to the prefix and create a new one every < TBD > objects uploaded


It still requires serializing all uploads.

arielshaqed · 2022-10-19T07:47:07Z

design/open/uncomitted-gc.md

+
+1. Repository objects will be stored under the prefix `<storage_namespace>/repos/<repo_uid>/`
+2. Branch objects will be stored under the repo prefix with the path `branches/<branch_id>/`
+3. Each lakeFS instance will create a unique prefix partition (serialized) under the branch path to 


Not sure what these partitions are: Staging tokens? Are different instances allowed to scan and use other instances' partitions?

Can you add some motivation for this? It raises some questions.

How "ascending" does this need to be? Would a timestamp, which is not really ascending but generally satisfies this, be good enough?

How "unique" does this need to be?

johnnyaug · 2022-10-19T09:41:44Z

design/open/uncomitted-gc.md

+#### StageObject
+
+1. Allowed only for address outside the repo namespace
+2. Prevent race between staging object and GC job


What kind of race? Is this the token thing?

By not allowing StageObject on addresses inside the repo namespace, we prevent a race where an un-referenced address is being staged using StageObject while it is being deleted by GC

johnnyaug · 2022-10-19T09:47:25Z

design/open/uncomitted-gc.md

+
+1. GetPhysicalAddress to return a validation token along with the address.
+2. The token will be valid for a specified amount of time and for a single use.
+3. LinkPhysicalAddress to verify token valid before creating an entry.


Yes, time sync introduces a challenge here. What if GC thinks a token has expired while the server thinks it's valid?

johnnyaug · 2022-10-19T09:52:06Z

design/open/uncomitted-gc.md

+
+#### Objects Path Conventions
+
+1. Repository objects will be stored under the prefix `<storage_namespace>/repos/<repo_uid>/`


Note that today, the storage namespace is tied to a repository - and not the installation.
So partitioning according to repository is irrelevant.

You are right - but we need to take into consideration scenarios where a repository is deleted and created again with the same name. Although it has the same name, the unique identifier is different and we need to take that into consideration.
Perhaps you have a better suggestion on how to do this?

Are you referring to cleaning up the data of deleted repositories? If so, I think we can relax this requirement.

I don't necessarily object to a "global storage namespace": I think it's a great idea. It's just that we may want to avoid it because it's too big at this point. If we choose to go this way, it should be mentioned explicitly in this doc (and have a design of its own when the time comes).

Even today, you can't* create a repository in a storage namespace that serves as a storage namespace for some other repo. So it doesn't matter if the repo is deleted, created by an installation that is no longer active, etc, you can't recreate the repo in the same storage namespace unless it was cleaned.

Repo creation checks for the dummy object in the root of the namespace. You can argue that it's not safe enough, but I think we can rely on it for the time being.

I didn't understand what we are trying to mitigate here.

johnnyaug · 2022-10-19T09:58:03Z

design/open/uncomitted-gc.md

+#### Objects Path Conventions
+
+1. Repository objects will be stored under the prefix `<storage_namespace>/repos/<repo_uid>/`
+2. For each repository branch, objects will be stored under the repo prefix with the path `branches/<branch_id>/`


I think partitioning according to branch is a brilliant idea, but I'm wondering whether it's premature optimization. It does introduce some complexity and I'm not sure we have evidence that it will benefit us in the real world.

This allows us to scope the GC process only to the commits and staging area that are related to the branch. Otherwise, we will have to read all the information from all commits and all staging areas to understand what we need to delete.

I don't understand if this is required for correctness or just an optimization.

johnnyaug · 2022-10-19T10:02:00Z

design/open/uncomitted-gc.md

+1. Run PrepareUncommittedForGC
+2. Read all addresses from branch commits -> `lakeFS DF`
+3. Read all addresses from branch `GC commit` -> `lakeFS DF`
+4. Read all objects on branch path directly from object store (can be done in parallel by 'partition') -> `Branch DF`


Is this a listing on the object store? In what sense is it parallel?

Since the objects on a given branch a divided into partition prefixes we can use workers to list objects by partition and aggregate the result

johnnyaug · 2022-10-19T10:08:05Z

design/open/uncomitted-gc.md

+lakeFS instance to store the branch's objects. This prefix will be composed of two parts:
+   1. Lexicographically sortable, descending time based serialization
+   2. A unique identifier for the lakeFS instance
+      `<sortable_descending_serialized_uid>_<lakefs_instance_uid>`  


Why is the instance ID a part of the partition?

Each instance writes only to its own partition under the branch prefix. This allows the lakeFS instance to track the amount of files uploaded to this partition and decide when to create a new partition

nopcoder

Comments in the body - think this is the way, just need to extend the description to how we handle delete of uncommitted data also above the branch level. (branch delete and repository delete).

nopcoder · 2022-10-19T03:58:07Z

design/open/uncomitted-gc.md

+
+## Design
+
+### Required changes by lakeFS


Suggest explaining the idea before the required changes to implement it. Or at least per section explain the reasoning on why we suggest the change so the implementor can understand the goal and not the how.

+1, I'm missing context to understand the suggested changes. It will add clarity if we explain the overall flow at a high level, possibly with a diagram, before diving into the separation between lakeFS server changes and Spark client changes.
Also, it will be useful to describe the idea behind the solution - e.g. the input for the algorithm is committed, uncommitted, objects ever written to lakefs, the algorithm aims to find objects that are in 3 and not in 1 or 2.

nopcoder · 2022-10-19T04:02:51Z

design/open/uncomitted-gc.md

+
+#### StageObject
+
+1. Allowed only for address outside the repo namespace


Need to explain the constraint and/or if we are going to remove/change any current functionality or just enforce existing one.

nopcoder · 2022-10-19T04:04:28Z

design/open/uncomitted-gc.md

+#### StageObject
+
+1. Allowed only for address outside the repo namespace
+2. Prevent race between staging object and GC job


Is this item a requirement from lakefs? or just explaining that the result of the previous requirement.

nopcoder · 2022-10-19T07:56:10Z

design/open/uncomitted-gc.md

+    2. Objects that were uploaded to a physical address issued by the API and were not linked before the token expired will
+       eventually be deleted by the GC job.
+
+#### CopyObject


Can you explain which copy object do you address - is it the option to do so using the S3 gateway? a new OpenAPI that we need to add to address the fact that we can't stage object from staging?
In case this is S3 only - we need to explain how we do the same using the OpenAPI

nopcoder · 2022-10-19T09:46:59Z

design/open/uncomitted-gc.md

+
+## Motivation
+
+Uncommitted data which is no longer referenced (due to branch deletion, reset branch etc.) is not being deleted by lakeFS.


Suggested change

Uncommitted data which is no longer referenced (due to branch deletion, reset branch etc.) is not being deleted by lakeFS.

Uncommitted data which is no longer referenced (due to any staged data deletion or override, reset branch etc.) is not being deleted by lakeFS.

nopcoder · 2022-10-19T09:58:28Z

design/open/uncomitted-gc.md

+1. GetPhysicalAddress to return a validation token along with the address.
+2. The token will be valid for a specified amount of time and for a single use.


Consider moving item two to be first - as it will require active tracking using storage, when we have storage we can also enforce any validation while using the link without embedding information on the path itself.

nopcoder · 2022-10-19T11:11:39Z

design/open/uncomitted-gc.md

+#### PrepareUncommittedForGC
+
+A new API which will create meta-ranges and ranges using the given branch uncommitted data. These files
+will be saved in a designated path used by the GC client to list branch's uncommitted objects.  


Prefer to have a new folder for each request in order to have history, support parallel request, prevent data delete or keeping old files around. And/or we should keep a state of the current request, progress, completion and errors.

A risk here will be that a single instance is responsible to extract all the staging information. The request can take a lot of time, it is handled by single instance that can go down and we will need to start from scratch.

From the work Yoni did, we should consider write this data in a parquet format.

Can we instead make this a paging call by passing the client some continuation token? I know that it will be slower for the client to read, but reducing load on the server may be more important. @lynnro314 has already done some research for how to speed up prepareGCCommits, I am not sure we want to open a new front with an API that may require similar work almost immediately.

nopcoder · 2022-10-19T11:21:06Z

design/open/uncomitted-gc.md

+
+#### PrepareUncommittedForGC
+
+A new API which will create meta-ranges and ranges using the given branch uncommitted data. These files


Just double-checking because the design talked about branch level - this is one set of files per repository?

nopcoder · 2022-10-19T11:47:56Z

design/open/uncomitted-gc.md

+4. The remainder is a list of files which can be safely removed
+5. Finally, save the current run's `GC commit` the last read partition and newest commit id in a designated location on the branch path
+
+## Limitations


I think that the solution should include - how we delete uncommitted data at the 3 levels:

delete repository

delete branch

delete object

nopcoder · 2022-10-19T11:48:52Z

design/open/uncomitted-gc.md

+#### Objects Path Conventions
+
+1. Repository objects will be stored under the prefix `<storage_namespace>/repos/<repo_uid>/`
+2. For each repository branch, objects will be stored under the repo prefix with the path `branches/<branch_id>/`


We should address create/delete/create of the same branch by using unique id also in the branch level.

design/open/uncomitted-gc.md

N-o-Z · 2022-10-19T15:34:07Z

Thank you everyone for the most invaluable input! 🙏🏽
We've decided to try and modify this proposal to remove the use of branch prefixes and use a time based partition only.
Please hold off any new comments until I'm able to address the current ones and modify to proposal.
I'll update once I've made all the required changes - Thanks!

talSofer

@N-o-Z thanks for this great proposal!
I'm posting my comments since they are unrelated to the changes you are planning to the document.

talSofer · 2022-10-19T15:07:58Z

design/open/uncomitted-gc.md

+
+## Design
+
+### Required changes by lakeFS


+1, I'm missing context to understand the suggested changes. It will add clarity if we explain the overall flow at a high level, possibly with a diagram, before diving into the separation between lakeFS server changes and Spark client changes.
Also, it will be useful to describe the idea behind the solution - e.g. the input for the algorithm is committed, uncommitted, objects ever written to lakefs, the algorithm aims to find objects that are in 3 and not in 1 or 2.

talSofer · 2022-10-20T05:32:36Z

design/open/uncomitted-gc.md

+#### Flow 1: Clean Run
+
+1. Run PrepareUncommittedForGC
+2. Read all addresses from branch commits -> `lakeFS DF`


How about using committedDF? It is more self explanatory

talSofer · 2022-10-20T05:41:07Z

design/open/uncomitted-gc.md

+
+The following describe the GC process run flows in the branch scope:
+
+#### Flow 1: Clean Run


@N-o-Z let's make sure we do discuss this detail before settling on a final design. I'm fine with separating the algorithmic discussion from job architecture but there are trade offs we need to consider here.

design/open/uncomitted-gc.md

talSofer · 2022-10-20T06:39:18Z

design/open/uncomitted-gc.md

+
+The following describe the GC process run flows in the branch scope:
+
+#### Flow 1: Clean Run


here to add clarity I suggest adding the intent behind each step, i.e.

mark uncommitted data

get all committed addresses

get all uncommitted addresses

get all objects ever written to lakeFS on that path...

subtract committed from all objects on the branch
etc...

talSofer · 2022-10-20T06:44:33Z

design/open/uncomitted-gc.md

+##### Step 1. Analyze Data and Perform Cleanup for old entries (GC client)
+
+1. Run PrepareUncommittedForGC
+2. Read addresses from branch's new commits (all new commits down to the last GC run commit id) -> `lakeFS DF`


To do that, you may want to use the run ID concept the existing GC has. See https://github.com/treeverse/cloud-controlplane/blob/main/design/accepted/gc-with-run-id.md in this context. This is our plan for implementing incremental GC, and I suggest that we consider using the same concept.

Co-authored-by: itaiad200 <itaiad200@gmail.com>

itaiad200 · 2022-10-25T12:06:08Z

design/open/uncomitted-gc.md

+since the last GC run.
+
+## Performance Requirements


I'm missing the actual requirement below. It is sort of an estimation of a single task of the uncommitted GC. How long will the entire GC run? What's the repo status (like number of commits, branches, objects to delete, etc.)

Jonathan-Rosenberg · 2022-10-25T14:26:33Z

design/open/uncomitted-gc.md

+The GC process is composed of 3 main parts:
+1. Listing namespace objects
+2. Listing of lakeFS repository committed objects
+3. Listing of lakeFS repository uncommitted objects


An object that's in case 3 means that there's an active branch that holds the uncommitted object, or is it any uncommitted object under that repository ("dangling" or not)?

An active branch that holds the uncommitted object.
We read the committed + uncommitted data from lakeFS via metaranges, so basically it's a static view of the repository (from lakeFS's point of view) in a specific point in time

itaidavid · 2022-10-25T20:37:45Z

design/open/uncomitted-gc.md

+Performing tests against an AWS S3 bucket using a Databricks notebook on a m4.large cluster, we've observed listing of 
+~500,000 objects takes approximately 25 seconds.
+We can estimate that on a repository with ~1B objects, using ~500K object size slices, and using 10 workers - listing will take
+around 1.5 hours.


How do we handle failures in such a long run - say, a crash after an hour of execution?
Does the entire process need to be rerun?
Maybe we can parallelize the GC itself, instead of just the listing?

GC job will be parallelized, but we can't rely on failed runs. To get a correct picture of the namespace the process must run successfully and without errors.
Relying on partial data, and failed jobs will increase risk of deleting committed data - which is something that absolutely must not happen

itaidavid

Thanks @N-o-Z
Added a question regarding the impact of an error in such a long process, but other than that LGTM.

itaiad200 · 2022-10-26T08:07:55Z

design/open/uncomitted-gc.md

+
+For each GC run, save the following information using the GC run id as detailed in this [proposal](https://github.com/treeverse/cloud-controlplane/blob/main/design/accepted/gc-with-run-id.md):
+1. Save metaranges in `_lakefs/gc/run_id/metadata/`
+2. Save `Uncommitted DF` in `_lakefs/gc/run_id/uncommitted.parquet`


Correct me if I'm wrong @treeverse/ecosystem, but parquet outputs are multiple files (per column) and it's normally saved under a prefix of it's own for better separation, right?
So a better path would be _lakefs/gc/run_id/uncommitted/entries.parquet

N-o-Z · 2022-10-26T10:51:31Z

Moved proposal to "accepted" folder.
Please take the time to give final notes - before I merge this PR

johnnyaug · 2022-10-26T11:41:15Z

design/accepted/gc_plus/uncomitted-gc.md

+#### [Get/Link]PhysicalAddress 
+
+1. GetPhysicalAddress to return a validation token along with the address (or embedded as part of the address).
+2. The token will be valid for a specified amount of time and for a single use.


Why does it matter that the token is valid for a single use?

This adds an operational requirement for time synchronization between lakeFS instances that belong to the same cluster. I am afraid of adding these because they can be hard to achieve.

As just one example, consider a physical machine in an on-prem cluster that has just booted after a long while down: if its clock is ahead of real time, NTP will take a while to slip it back to the correct value, and in the meantime lakeFS will be up and happily destroying our assumptions :-(

Also on AWS awslabs/amazon-eks-ami#249 is a recurring issue in AMIs, and a linked issue has someone with a 7 minute time skew on EKS.

Clocks should be synced, but I think we will exact too strong a penalty for failure here. We should add:

Some safety mechanism.

For instance:

If when we scan we discover prefixes from the distant future (>1 minute?) then error out and mark some metric for alerting.

Create an additional marker object in every partition once an hour, and scan what the other cluster members put in theirs. This gives an estimate of time skew, which can be monitored.

Operational guidance (strongly worded documentation).

itaiad200

Found a bug :(

itaiad200 · 2022-10-30T13:30:49Z

design/accepted/gc_plus/uncomitted-gc.md

+#### PrepareUncommittedForGC
+
+A new API which will create meta-ranges and ranges for a given branch using its uncommitted data. These files
+will be saved in a designated path used by the GC client to list branch's uncommitted objects.  
+For the purpose of this document we'll call this the `BranchUncommittedMetarange`


I think there's a bug.
I start preparing the uncommitted objects. During that time a Copy is happening on the same branch, making it a metadata operation. The copy copies from the path zz to the path aa. The preparation of the uncommitted started before aa was created, but by the time it reaches zz it was already deleted. Now I don't see neither of them in the output file and the physical object may be deleted by the GC.

Thanks! Addressed in proposal

itaiad200

Thanks for addressing the comments, I added a few more but wish not to further block.

itaiad200 · 2022-11-03T08:40:24Z

design/accepted/gc_plus/uncomitted-gc.md

-1. Copy object in the same branch will work the same - creating a new staging entry using the existing entry information.
-2. For objects that are not part of the branch, use the underlying adapter copy operation.
+When performing a shallow copy - track copied objects in ref-store.
+GC will read the copied objects information from the ref-store, and will add them to the list of uncommitted.


The timing here is critical - it MUST read the new table after it iterated on the staging area of all branches.

Explained in PrepareUncommittedForGC - added additional note to emphasize

itaiad200 · 2022-11-03T08:43:09Z

design/accepted/gc_plus/uncomitted-gc.md


 #### Move/RenameObject
 Clients working through the S3 Gateway can use the CopyObject + DeleteObject to perform a Rename or Move operation.
 For clients using the OpenAPI this could have been done using StageObject + DeleteObject.
 To continue support of this operation, introduce a new API to rename an object which will be scoped to a single branch.
+Rename will add copied objects in ref-store similarly to CopyObject 


So it needs to maintain that new table too, right?

itaiad200 · 2022-11-03T08:50:05Z

design/open/uncomitted-gc.md

+#### Move/RenameObject
+Clients working through the S3 Gateway can use the CopyObject + DeleteObject to perform a Rename or Move operation.
+For clients using the OpenAPI this could have been done using StageObject + DeleteObject.
+To continue support of this operation, introduce a new API to rename an object which will be scoped to a single branch.


@N-o-Z Still relevant - If we have a way to logically do a rename with copy and delete, why do we need a non-atomic rename in the API?

arielshaqed

Thanks!

Really trying to avoid the namespace reorg, it sounds expensive :-(

arielshaqed · 2022-11-06T09:01:31Z

design/accepted/gc_plus/uncomitted-gc.md

+
+The heaviest operation during the GC process, is the namespace listing. And while we added the above optimizations to mitigate
+this process, the fact remains - we still need to scan the entire namespace (in the Clean Run mode).
+Performing tests against an AWS S3 bucket using a Databricks notebook on a m4.large cluster, we've observed listing of 


Can we use a larger instance, to see if we are blocked on network or on something else?

Checked with an xlarge cluster and max workers 10.
Did some more tests and created a udf which reads a specific partition and returns the objects information.
For 1,000,000 objects divided equally into 100 partitions I was able to load the information into a DF within 30 seconds.
The results looks pretty promising - and I'm sure we can also improve on that.
I will update the proposal with final results this week

design/accepted/gc_plus/uncomitted-gc.md

arielshaqed · 2022-11-06T09:54:39Z

design/accepted/gc_plus/uncomitted-gc.md

+
+### 3. Listing of lakeFS repository uncommitted objects
+
+Expose a new API in lakeFS which writes repository uncommitted objects information into a parquet file in a dedicated path


Suggest not fixing a format for now: it is far from clear which format is best to use, but luckily this looks like an implementation detail that is easy to change.

I agree this is an implementation detail and will remove the format specific details from the proposal

arielshaqed · 2022-11-06T09:55:37Z

design/accepted/gc_plus/uncomitted-gc.md

+#### [Get/Link]PhysicalAddress 
+
+1. GetPhysicalAddress to return a validation token along with the address (or embedded as part of the address).
+2. The token will be valid for a specified amount of time and for a single use.


This adds an operational requirement for time synchronization between lakeFS instances that belong to the same cluster. I am afraid of adding these because they can be hard to achieve.

As just one example, consider a physical machine in an on-prem cluster that has just booted after a long while down: if its clock is ahead of real time, NTP will take a while to slip it back to the correct value, and in the meantime lakeFS will be up and happily destroying our assumptions :-(

arielshaqed · 2022-11-06T09:57:02Z

design/accepted/gc_plus/uncomitted-gc.md

+lakeFS will track copy operations of uncommitted objects and store them in the ref-store for a limited duration.
+GC will use this information as part of the uncommitted data to avoid a race between the GC job and rename operation.
+lakeFS will periodically scan these entries and remove copy entries from the ref-store after such time that will 
+allow correct execution of the GC process.  


Not sure that I understand. How does lakeFS know that GC actually ran?

arielshaqed · 2022-11-06T09:57:23Z

design/accepted/gc_plus/uncomitted-gc.md

+#### Track copied objects in ref-store
+
+lakeFS will track copy operations of uncommitted objects and store them in the ref-store for a limited duration.
+GC will use this information as part of the uncommitted data to avoid a race between the GC job and rename operation.


Copy operation? Not sure lakeFS can support atomic rename.

We are tracking copy operations but the purpose is to avoid a race in rename.
Consider the following:

GC starts reading branch staging area

Rename operation create a copy of staged entry - which doesn't not get listed due to listing order

Rename operation deletes original entry before GC is able to scan it as part of the listing of staging area

As a result, the physical address will not be listed in uncommitted (or committed) and will be candidate for deletion.
Reading from the copy table - allows exempting copied objects from the delete list and avoid this race.

The copied entries lifespan can be minutes-hour or can be removed as part of a branch operation (commit/reset/delete branch)

In the next GC iteration - these addresses will either be in committed, still in staging or deleted and will be handled accordingly

arielshaqed · 2022-11-06T09:58:35Z

design/accepted/gc_plus/uncomitted-gc.md

+
+When performing a shallow copy - track copied objects in ref-store.
+GC will read the copied objects information from the ref-store, and will add them to the list of uncommitted.
+lakeFS will periodically clear the copied list according to timestamp.


How long is this? The shorter it is, the greater the requirement for time sync between lakeFS instances. So it seems like it needs to be fairly large.

arielshaqed

Neat!

I'm only worried about the operational aspect of maintaining synchronized times across the cluster of lakeFS servers. But I believe that we can observe the time skew automatically, even as part of time-based partitioning!

Given that we're talking about deleting files, we should be very sure that we help operators control time sync. Ideally fail-shut and refuse to GC uncommitted or even work at all when time sync is not good enough. (I'd talking about minutes-level time sync here, of course. NTP gives you milliseconds-level time sync. But virtualization is notorious for screwing up time sync occasionally.)

design/accepted/gc_plus/uncomitted-gc.md

arielshaqed · 2022-11-07T06:40:43Z

design/accepted/gc_plus/uncomitted-gc.md

+#### [Get/Link]PhysicalAddress 
+
+1. GetPhysicalAddress to return a validation token along with the address (or embedded as part of the address).
+2. The token will be valid for a specified amount of time and for a single use.


Also on AWS awslabs/amazon-eks-ami#249 is a recurring issue in AMIs, and a linked issue has someone with a 7 minute time skew on EKS.

Clocks should be synced, but I think we will exact too strong a penalty for failure here. We should add:

Some safety mechanism.

For instance:

If when we scan we discover prefixes from the distant future (>1 minute?) then error out and mark some metric for alerting.

Create an additional marker object in every partition once an hour, and scan what the other cluster members put in theirs. This gives an estimate of time skew, which can be monitored.

Operational guidance (strongly worded documentation).

N-o-Z added proposal exclude-changelog PR description should not be included in next release changelog labels Oct 18, 2022

N-o-Z requested review from nopcoder, ozkatz, arielshaqed, itaiad200, eden-ohana and itaidavid October 18, 2022 11:31

N-o-Z self-assigned this Oct 18, 2022

itaiad200 requested changes Oct 18, 2022

View reviewed changes

itaiad200 requested a review from talSofer October 18, 2022 14:10

itaidavid reviewed Oct 18, 2022

View reviewed changes

itaidavid reviewed Oct 19, 2022

View reviewed changes

talSofer requested a review from johnnyaug October 19, 2022 06:56

arielshaqed reviewed Oct 19, 2022

View reviewed changes

johnnyaug reviewed Oct 19, 2022

View reviewed changes

nopcoder requested changes Oct 19, 2022

View reviewed changes

johnnyaug reviewed Oct 19, 2022

View reviewed changes

design/open/uncomitted-gc.md Outdated Show resolved Hide resolved

talSofer reviewed Oct 20, 2022

View reviewed changes

N-o-Z and others added 3 commits October 23, 2022 13:09

Proposal: Uncommitted GC

6baa12e

Update design/open/uncomitted-gc.md

2704f38

Co-authored-by: itaiad200 <itaiad200@gmail.com>

CR Fixes

5cfe0eb

N-o-Z added team/versioning-engine Team versioning engine GC+ labels Oct 25, 2022

N-o-Z added 2 commits October 25, 2022 11:23

Add 'data' prefix to repository structure

9651952

Add performance requirements

9bbea12

itaiad200 reviewed Oct 25, 2022

View reviewed changes

nopcoder approved these changes Oct 25, 2022

View reviewed changes

Jonathan-Rosenberg reviewed Oct 25, 2022

View reviewed changes

N-o-Z added 2 commits October 25, 2022 19:12

Add GC metadata structure

002ffd8

Improve performance section

3bad7d0

itaidavid reviewed Oct 25, 2022

View reviewed changes

itaidavid approved these changes Oct 25, 2022

View reviewed changes

itaiad200 reviewed Oct 26, 2022

View reviewed changes

N-o-Z added 2 commits October 26, 2022 11:16

Fix uncommitted path

64dbe6c

Move to accepted

943896c

johnnyaug reviewed Oct 26, 2022

View reviewed changes

Additional fixes

2e83956

itaiad200 requested changes Oct 30, 2022

View reviewed changes

Address copy object bug

89f6290

itaiad200 approved these changes Nov 3, 2022

View reviewed changes

Copy additions

ed50729

arielshaqed reviewed Nov 6, 2022

View reviewed changes

arielshaqed reviewed Nov 7, 2022

View reviewed changes

N-o-Z added 4 commits November 8, 2022 20:34

Remove rename - add CopyObject

cea5991

Change artifacts location

2c70a22

Modified performance requirements

637c94c

Fix typo

6dae41a

N-o-Z merged commit 0e15957 into master Nov 10, 2022

N-o-Z deleted the proposal/offline-uncommitted-gc-2 branch November 10, 2022 16:49

johnnyaug mentioned this pull request Nov 13, 2022

Initial GC+ client logic #4592

Merged

	2. Read addresses from branch's new commits (all commits up to the last GC run commit id) -> `lakeFS DF`
	2. Read addresses from branch's new commits (all commits from to the last GC run commit id) -> `lakeFS DF`

	1. Read all objects on branch path up to the previous run's last read partition (can be done in parallel by 'partition') -> `Branch DF`
	1. Read all objects on branch path up from the previous run's last read partition (can be done in parallel by 'partition') -> `Branch DF`


		The following describe the GC process run flows in the branch scope:

		#### Flow 1: Clean Run


		#### PrepareUncommittedForGC

		A new API which will create meta-ranges and ranges using the given branch uncommitted data. These files


		## Limitations

		* Since this solution relies on the new repository structure, it is not backwards compatible. Therefore, another solution will be required for existing


		#### Objects Path Conventions

		1. Repository objects will be stored under the prefix `<storage_namespace>/repos/<repo_uid>/`


		#### StageObject

		1. Allowed only for address outside the repo namespace


		## Motivation

		Uncommitted data which is no longer referenced (due to branch deletion, reset branch etc.) is not being deleted by lakeFS.

		1. GetPhysicalAddress to return a validation token along with the address.
		2. The token will be valid for a specified amount of time and for a single use.


		### 3. Listing of lakeFS repository uncommitted objects

		Expose a new API in lakeFS which writes repository uncommitted objects information into a parquet file in a dedicated path

Proposal: Uncommitted GC #4382

Proposal: Uncommitted GC #4382

Conversation

N-o-Z commented Oct 18, 2022 • edited Loading

itaiad200 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

N-o-Z Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozkatz commented Oct 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

N-o-Z commented Oct 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itaidavid left a comment

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

N-o-Z Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnnyaug Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

N-o-Z Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nopcoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

N-o-Z commented Oct 18, 2022 •

edited

Loading

N-o-Z Oct 18, 2022 •

edited

Loading

N-o-Z Oct 19, 2022 •

edited

Loading

johnnyaug Oct 19, 2022 •

edited

Loading

N-o-Z Oct 19, 2022 •

edited

Loading