Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC+/ Milestone 2 execution plan #4801

Merged
merged 5 commits into from
Dec 18, 2022
Merged

Conversation

eden-ohana
Copy link
Contributor

@eden-ohana eden-ohana commented Dec 12, 2022

closes #4802

@eden-ohana eden-ohana added exclude-changelog PR description should not be included in next release changelog GC+ labels Dec 12, 2022
@eden-ohana eden-ohana requested a review from a team December 12, 2022 22:09
Comment on lines 5 to 6
## Milestone 1
The first beta version that was released included:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Milestone 1
The first beta version that was released included:
## Milestone 1
The first beta version that was released included:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

1. Implementation of the clean run flow for old and new repository structures (without optimizations)
2. Mark & Sweep
3. Integration tests
4. Backup & Restore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimal support using rclone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

## Milestone 2

### Goals
1. Removing the limitation of a read-only lakeFS during the job run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job run == gc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines 16 to 17
2. Performance improvements - better parallelization of the storage namespace traversal
3. Implementing Optimized Run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 and 3 are performance and optimizations - the first item is self explained, I didn't understand the second one.
Do you mean incremental uncommitted gc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nopcoder Optimized run means incremental
@eden-ohana I think we should focus this milestone on optimization of the full run and adding more testing (unit / integration) - and create a separate milestone for the incremental/optimized run

* [lakeFSFS renameObject method](https://github.com/treeverse/lakeFS/issues/4479)
* [Track copied objects in ref-store](https://github.com/treeverse/lakeFS/issues/4562)

3. Performance improvements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all uncommitted GC? maybe we can specify it as there are lakefs, common code and etc.

Suggested change
3. Performance improvements:
2. Performance improvements:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

3. Performance improvements:
* [Optimaized listing on old repository structure](https://github.com/treeverse/lakeFS/issues/4620)
* [Efficient listing on committed entries](https://github.com/treeverse/lakeFS/issues/4600)
* [Optimized listing on slices](https://github.com/treeverse/lakeFS/issues/4614)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought we already doing this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nopcoder Yes you did :) I think we can close this issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

3. Implementing Optimized Run

### Non-Goals
1. Support for non-S3 repositories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big bullet - I think it should be itemized by storage providers. Also I really think we should do it incremental.
For example - for this milestone, starting with Azure - and afterwards working on GCS

Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eden-ohana Thanks for creating this plan.
It looks well put through - I think we learned a lot from the first milestone, especially about our expectations and we should put this into this plan.
There is a lot of work hidden inside the milestone goals lines in my opinion.
I believe we should focus on improving our current solution for the 2nd milestone - structuring it, creating more tests and add optimizations.
We should also start to create the ground for incremental runs but I think the client implementation itself should be planned for milestone 3

@eden-ohana
Copy link
Contributor Author

@ozkatz can you give context from a product perspective? Is there any requirement to prioritize the GC+ for Azure?

@ozkatz
Copy link
Collaborator

ozkatz commented Dec 13, 2022

@eden-ohana I was under the impression that since the "Sweep" part of GC+ is re-used (i.e. it's the same "sweep" that is already implemented for Azure) we should support this ootb?

There are active, production installations that we know of that are running on Azure, with compliance requirements that are very hard to meet without GC+..

@eden-ohana
Copy link
Contributor Author

@eden-ohana I was under the impression that since the "Sweep" part of GC+ is re-used (i.e. it's the same "sweep" that is already implemented for Azure) we should support this ootb?

There are active, production installations that we know of that are running on Azure, with compliance requirements that are very hard to meet without GC+..

The sweep part is reused but we added more functionalities that weren't tesed on Azure

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It feels like a very-stressed milestone. I think we should settle for the "safe-write" lakeFS goal and leave the rest for the next milestones. The timeline itself is too short for all three, I think...


* marks dependency

1. Required changes by lakeFS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing a non-trivial test issue here. I'm not sure how to test that lakeFS is safe while GC is running. Feels like we should plan for that here too.

* [Optimaized listing on old repository structure](https://github.com/treeverse/lakeFS/issues/4620)
* [Efficient listing on committed entries](https://github.com/treeverse/lakeFS/issues/4600)
* Benchmarks - verify uncommitted GC [performance requirements](https://github.com/treeverse/lakeFS/blob/e316cafe7717bb3203e4018837a41415aa61f74b/design/accepted/gc_plus/uncommitted-gc.md?plain=1#L185) are kept
2. [Implement optimized run flow](https://github.com/treeverse/lakeFS/issues/4489)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we implementing the incremental run? I thought we decided to do it in the next milestone

3. Support for non-S3 repositories
* Azure
* GCP
4. Incorporation of committed & uncommitted GC into a single job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - I thought we were going to push it towards the end of development


By the end of this milestone, we will release a new beta version that includes all the additions.

**Due date: 15/01/2023**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this due date as a bit ambitious. We still left a lot of work inside this milestone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't anymore :)

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🙏

@eden-ohana eden-ohana merged commit fa39e88 into master Dec 18, 2022
@eden-ohana eden-ohana deleted the gc+/mileston2-execution-plan branch December 18, 2022 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog GC+
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GC+/Milestone 2 execution plan
5 participants