GC should be more resilient for flaky backends #15469

Vad1mo · 2021-08-23T21:52:57Z

Object storages such as S3, Swift and the respective different implementation can act inconsistent and event fail to deliver data. Especially under load or with many objects.

Imagine when the GC runs over 5 TB of data and there is somewhere a timeout or other issue. The whole GC process just stops.

Here is such an example:

2021-08-23T12:36:33Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:319]: delete blob record from database: 250242, sha256:077dd2e132e2c571aaf460945df023c426c54bdae705218245842caa2be5787f
2021-08-23T12:36:33Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: a9/bur_mks, application/vnd.docker.distribution.manifest.v2+json, sha256:482fd0cc27d97e226581232b82e761d27e6aba5b943d82b7d5568456f7454302
2021-08-23T13:06:34Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, a9/bur_mks, sha256:482fd0cc27d97e226581232b82e761d27e6aba5b943d82b7d5568456f7454302, retry timeout
2021-08-23T13:06:34Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: a9/bur_mks, sha256:482fd0cc27d97e226581232b82e761d27e6aba5b943d82b7d5568456f7454302: retry timeout

My recommendation is to make GC more resilient so that it can carry on even if there are errors with individual repositories.

wy65701436 · 2021-08-24T02:32:56Z

By default, Retry deletion will try to remove blob & manifest in one minute. If let the GC job to continue executing, it may still encounter the deletion failure.

Vad1mo · 2021-08-24T06:49:43Z

We are now running on 2.3.1. The Long-term problem is that when there is a permanent problem with a blob or manifest, the registry keeps on growing.

wy65701436 · 2021-08-26T06:51:14Z

Understand your problem, it could be a GC backlog.

However, IMO, since GC is a high risk task within system and the currently it runs well, we should cautious update the code.

dkulchinsky · 2021-11-03T20:38:04Z

just chiming in here, but we are having several persisting issues with GC:

GC fails when manifest not found - currently this is the biggest issue in our production instance GC fails when manifest not found #15822
GC fails due to backend storage issue like described here, we use GCS and it's quite common to get 500s from GCS which I think GC should retry but if unsuccessful allow the GC to continue and clean up the rest of the blobs/manifests, currently it just fails completely and we end up with thousands on artifacts we can't cleanup
GC becomes extremely slow when repositories reach over 4/5K tags (each manifest delete takes ~2 minutes), it becomes completely unusable when there are over ~15K of tags where it just times outs after 20 minutes [GC performance] The performance of v2 manifest deletion is not good in S3 environment #12948
GC fails with "invalid checksum digest format" from registry when deleting manifests #15970

I think these issues deserve a more immediate attention, we are reaching a point where GC is simply broken due to a combination of all these issues and I don't think we are the only ones that would hit this.

sidewinder12s · 2021-11-08T17:32:25Z

This is a massive issue for us. We have 100s of thousands of tags and 200+ TB of data to be deleted and because the GC task fails on retries and then must rebuild the proposed list to be deleted on every run, we cannot get through GC.

2021-11-05T22:27:09Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: library/tag/tag, application/vnd.docker.distribution.manifest.v2+json, sha256:f12b18bb747e33e94aae7b9bc7883979217eedcf0414e7e59ea264a8535844f0
2021-11-05T22:57:09Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, library/tag/tag, sha256:f12b18bb747e33e94aae7b9bc7883979217eedcf0414e7e59ea264a8535844f0, retry timeout
2021-11-05T22:57:09Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: library/tag/tag, sha256:f12b18bb747e33e94aae7b9bc7883979217eedcf0414e7e59ea264a8535844f0: retry timeout

This appears to be #12948

wy65701436 self-assigned this Aug 24, 2021

wy65701436 added the area/gc label Aug 24, 2021

wy65701436 added the kind/spike Technical investigation work before provide concrete plan or estimation for a feature label Aug 26, 2021

Vad1mo mentioned this issue Nov 3, 2021

GC fails when manifest not found #15822

Closed

wy65701436 mentioned this issue Nov 16, 2021

GC enhancement engineering story #16010

Open

4 tasks

This was referenced Dec 8, 2021

Gc enhancement 2.5 - fault tolerance & parallel deletion in gc #16090

Closed

feat: add failure-tolerance for gc #16094

Merged

zyyw closed this as completed in #16094 Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC should be more resilient for flaky backends #15469

GC should be more resilient for flaky backends #15469

Vad1mo commented Aug 23, 2021

wy65701436 commented Aug 24, 2021

Vad1mo commented Aug 24, 2021

wy65701436 commented Aug 26, 2021

dkulchinsky commented Nov 3, 2021 •

edited

Loading

sidewinder12s commented Nov 8, 2021 •

edited

Loading

GC should be more resilient for flaky backends #15469

GC should be more resilient for flaky backends #15469

Comments

Vad1mo commented Aug 23, 2021

wy65701436 commented Aug 24, 2021

Vad1mo commented Aug 24, 2021

wy65701436 commented Aug 26, 2021

dkulchinsky commented Nov 3, 2021 • edited Loading

sidewinder12s commented Nov 8, 2021 • edited Loading

dkulchinsky commented Nov 3, 2021 •

edited

Loading

sidewinder12s commented Nov 8, 2021 •

edited

Loading