Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is LFS store garbage collected? #7045

Closed
2 of 7 tasks
yacoob opened this issue May 25, 2019 · 4 comments · Fixed by #22385
Closed
2 of 7 tasks

Is LFS store garbage collected? #7045

yacoob opened this issue May 25, 2019 · 4 comments · Fixed by #22385
Labels
type/feature Completely new functionality. Can only be merged if feature freeze is not active. type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@yacoob
Copy link

yacoob commented May 25, 2019

  • Gitea version (or commit ref): 1.8.1
  • Git version: 2.21.0
  • Operating system: Linux
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

I'm trying to understand how is Gitea's LFS store gets garbage collected. I can see some references to LFS object removal in the code, but I can't find a definite answer when exactly are unreferenced blobs removed from LFS directory. As a test, I've created a repository on gitea, pushed some LFS objects to it, then removed the branch referencing them and forced a gc in the admin panel. The objects under git/lfs are still present.

Does this kind of gc happen at all? Or is it only after whole repository is removed? If there's no automatic gc, please treat this bug as a feature request. If there is, consider this a documentation request.

Thanks!

@yasuokav
Copy link
Contributor

You have to delete the repository to remove the LFS objects from disk.

@lunny lunny added the type/question Issue needs no code to be fixed, only a description on how to fix it yourself. label May 26, 2019
@yacoob
Copy link
Author

yacoob commented Jun 1, 2019

That's the way only GC that happens for lfs? Okay - can we treat this topic as a request to implement something more granular, that would run during git gc?

Thanks!

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2019

There is one big store - files are not stored by repository but by oid. The repository information is kept separately. We don't have the putative filename - LFS never gives us it - we never get the SHA of the pointer file that points to the oid and although you can guess what the bland pointer should be, the spec allows for extensions so you won't be able to guess them all. You can't even simply use a .gitattributes file stored within the repository - as it might not be stored - and they might not call the filter lfs!

Therefore in terms of a GC, what you would have to do is:

  • Get a list of the oids that are stored in the LFS for the repo. (Simple select on the database)

  • Walk the git repository, find all blobs <=1k, check if they look like a pointer file, if so get the oid, check if it's stored in the LFS and is associated with the repo.

  • Give a diff of the two (possibly three) states.

  • Any unreachable LFS objects by repository suggest deletion? I guess, but you don't know why they're there - you're assuming LFS is only being used by git-lfs. This might be useful to know about and then you could prune these but this can't be automatically done.

  • What about potential oids that are missing - either because they're not attached to the repo or they're not in the LFS? Well are you sure that they're actually pointers rather than just files that look like pointers? (You cannot tell the difference - you can't assume .gitattributes is present and you can't really assume that they're only placed there by filter.lfs.* commands either.)

  • Do you reveal that you have a file matching an oid but one that is not attached to the oid? It could be a security issue to do so - although if sha256 is a secure hash the only way you should have the hash is if you have the object.

In #7082 I decided that the only sensible thing to do when merging a pr from one repository to another was to check if a blob could be a pointer file, check if it's oid is in the LFS and if so associate it with the base repository. (I probably should only add it to the base repository if that oid is actually associated with the head repository for that possible security reason above.)

When we display files in the UI we tend to just check if the blob looks like a pointer and then if the oid is associated with repository assume it's meant to be an LFS object.

It's only during uploads to repositories that we actually pay attention to .gitattributes as that's the only possible hint we have that an object should be in the LFS.

It's not simple at all. The spec for LFS is so extensible that you just don't know why an object has been placed in the LFS.

There is one final thing that might be useful - find all things in the store that are not associated with a repo - then you have to walk all the repos and try to find out if they could match a repo.

@stale
Copy link

stale bot commented Jul 31, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 weeks. Thank you for your contributions.

@stale stale bot added the issue/stale label Jul 31, 2019
@yacoob yacoob closed this as completed Jul 31, 2019
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
@zeripath zeripath reopened this Nov 12, 2022
@stale stale bot removed the issue/stale label Nov 12, 2022
@lunny lunny added type/feature Completely new functionality. Can only be merged if feature freeze is not active. type/proposal The new feature has not been accepted yet but needs to be discussed first. and removed type/question Issue needs no code to be fixed, only a description on how to fix it yourself. labels Dec 5, 2022
zeripath added a commit to zeripath/gitea that referenced this issue Jan 9, 2023
This PR adds a task to the cron service to allow garbage collection of
LFS meta objects. As repositories may have a large number of
LFSMetaObjects, an updated column is added to this table and it is used
to perform a generational GC to attempt to reduce the amount of work.
(There may need to be a bit more work here but this is probably enough
for the moment.)

Fix go-gitea#7045

Signed-off-by: Andrew Thornton <art27@cantab.net>
@go-gitea go-gitea unlocked this conversation Jan 12, 2023
jolheiser pushed a commit that referenced this issue Jan 16, 2023
This PR adds a task to the cron service to allow garbage collection of
LFS meta objects. As repositories may have a large number of
LFSMetaObjects, an updated column is added to this table and it is used
to perform a generational GC to attempt to reduce the amount of work.
(There may need to be a bit more work here but this is probably enough
for the moment.)

Fix #7045

Signed-off-by: Andrew Thornton <art27@cantab.net>
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/feature Completely new functionality. Can only be merged if feature freeze is not active. type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants