Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC should be more resilient for flaky backends #15469

Closed
Vad1mo opened this issue Aug 23, 2021 · 5 comments · Fixed by #16094
Closed

GC should be more resilient for flaky backends #15469

Vad1mo opened this issue Aug 23, 2021 · 5 comments · Fixed by #16094
Assignees
Labels
area/gc kind/spike Technical investigation work before provide concrete plan or estimation for a feature

Comments

@Vad1mo
Copy link
Member

Vad1mo commented Aug 23, 2021

Object storages such as S3, Swift and the respective different implementation can act inconsistent and event fail to deliver data. Especially under load or with many objects.

Imagine when the GC runs over 5 TB of data and there is somewhere a timeout or other issue. The whole GC process just stops.

Here is such an example:

2021-08-23T12:36:33Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:319]: delete blob record from database: 250242, sha256:077dd2e132e2c571aaf460945df023c426c54bdae705218245842caa2be5787f
2021-08-23T12:36:33Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: a9/bur_mks, application/vnd.docker.distribution.manifest.v2+json, sha256:482fd0cc27d97e226581232b82e761d27e6aba5b943d82b7d5568456f7454302
2021-08-23T13:06:34Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, a9/bur_mks, sha256:482fd0cc27d97e226581232b82e761d27e6aba5b943d82b7d5568456f7454302, retry timeout
2021-08-23T13:06:34Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: a9/bur_mks, sha256:482fd0cc27d97e226581232b82e761d27e6aba5b943d82b7d5568456f7454302: retry timeout

My recommendation is to make GC more resilient so that it can carry on even if there are errors with individual repositories.

@wy65701436 wy65701436 self-assigned this Aug 24, 2021
@wy65701436
Copy link
Contributor

By default, Retry deletion will try to remove blob & manifest in one minute. If let the GC job to continue executing, it may still encounter the deletion failure.

@Vad1mo
Copy link
Member Author

Vad1mo commented Aug 24, 2021

We are now running on 2.3.1. The Long-term problem is that when there is a permanent problem with a blob or manifest, the registry keeps on growing.

@wy65701436
Copy link
Contributor

Understand your problem, it could be a GC backlog.

However, IMO, since GC is a high risk task within system and the currently it runs well, we should cautious update the code.

@wy65701436 wy65701436 added the kind/spike Technical investigation work before provide concrete plan or estimation for a feature label Aug 26, 2021
@dkulchinsky
Copy link
Contributor

dkulchinsky commented Nov 3, 2021

just chiming in here, but we are having several persisting issues with GC:

  1. GC fails when manifest not found - currently this is the biggest issue in our production instance GC fails when manifest not found #15822
  2. GC fails due to backend storage issue like described here, we use GCS and it's quite common to get 500s from GCS which I think GC should retry but if unsuccessful allow the GC to continue and clean up the rest of the blobs/manifests, currently it just fails completely and we end up with thousands on artifacts we can't cleanup
  3. GC becomes extremely slow when repositories reach over 4/5K tags (each manifest delete takes ~2 minutes), it becomes completely unusable when there are over ~15K of tags where it just times outs after 20 minutes [GC performance] The performance of v2 manifest deletion is not good in S3 environment #12948
  4. GC fails with "invalid checksum digest format" from registry when deleting manifests #15970

I think these issues deserve a more immediate attention, we are reaching a point where GC is simply broken due to a combination of all these issues and I don't think we are the only ones that would hit this.

@sidewinder12s
Copy link

sidewinder12s commented Nov 8, 2021

This is a massive issue for us. We have 100s of thousands of tags and 200+ TB of data to be deleted and because the GC task fails on retries and then must rebuild the proposed list to be deleted on every run, we cannot get through GC.

2021-11-05T22:27:09Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: library/tag/tag, application/vnd.docker.distribution.manifest.v2+json, sha256:f12b18bb747e33e94aae7b9bc7883979217eedcf0414e7e59ea264a8535844f0
2021-11-05T22:57:09Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, library/tag/tag, sha256:f12b18bb747e33e94aae7b9bc7883979217eedcf0414e7e59ea264a8535844f0, retry timeout
2021-11-05T22:57:09Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: library/tag/tag, sha256:f12b18bb747e33e94aae7b9bc7883979217eedcf0414e7e59ea264a8535844f0: retry timeout

This appears to be #12948

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gc kind/spike Technical investigation work before provide concrete plan or estimation for a feature
Projects
None yet
4 participants