Skip to content

[P0] Fix #3513 ([MAJOR] content digest not found) #4062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 6, 2025

Conversation

apostasie
Copy link
Contributor

@apostasie apostasie commented Mar 31, 2025

Current status

Discussion and patch upstream: containerd/containerd#11628

TL;DR here: this PR here on nerdctl does seem to fix our issues (many runs of the CI with no issue), because it entirely bypasses converter.Convert Delete logic which seems to trigger some underlying containerd issue related to lease and deletion (<- my opinion, unverified).

My patch upstream does ALSO fix our issue, with a clear reproducer, although it probably just avoids the underlying containerd problem.


Original post below

It looks like containerd imagestore.Delete is asynchronous by default.

If you want a synchronous delete, you have to pass an additional opt.

Of course, this seems to be a problem when you:

  • check if the image exist
  • delete it (async) if it does
  • create a new version of it

This seem to be the case (actually, it is worse) for https://github.com/containerd/containerd/blob/af013606262f9df1de4cc80688725a15a375fb3a/core/images/converter/converter.go#L120-L121
... where convert.Convert just forcefully delete, whether or not the image existed, async...

This of course is also something we are doing internally.

This PR addresses these issues:

  • ensure we delete SYNCHRONOUSLY where it matters
  • honor the async option when there is one
  • wrap / workaround containerd (* maybe) faulty implementation of convert

At this point, this is still a hunch - but for the first time I feel confident that we may be reaching the bottom of it, as this explanation closely matches the soul-destroying symptoms we have witnessed in eg: #3509 (comment)

and more generally, with anything touching containerd Convert (see #3513).

I will close an open this PR many times and confirm that the infamous #3513 is finally gone, but I would like the expert opinion of @lingdie on this - if you have time for a quick review.

Upstream PR: containerd/containerd#11628

Note that this PR here does not try to do more than that - there MAY still be places where we would need to EnsureContent.

EDIT: looks like containerd maintainer disagree with this, so, maybe I am misreading this.
Anyhow, I will keep closing / opening and run the CI - we are currently at:

@apostasie apostasie changed the title [WIP] Fix content digest not found for tmp reduced platform [DEBUGGING] [WIP] Fix content digest not found for tmp reduced platform Mar 31, 2025
@apostasie apostasie changed the title [DEBUGGING] [WIP] Fix content digest not found for tmp reduced platform [DEBUGGING] Tentatively fix #3513 Mar 31, 2025
@apostasie apostasie marked this pull request as draft March 31, 2025 19:37
@apostasie
Copy link
Contributor Author

apostasie commented Mar 31, 2025

Green once. 🤔

@apostasie apostasie closed this Mar 31, 2025
@apostasie apostasie reopened this Mar 31, 2025
@@ -127,7 +127,7 @@ func Build(ctx context.Context, client *containerd.Client, options types.Builder
if _, err := imageService.Create(ctx, image); err != nil {
// if already exists; skip.
if errors.Is(err, errdefs.ErrAlreadyExists) {
if err = imageService.Delete(ctx, targetRef); err != nil {
if err = imageService.Delete(ctx, targetRef, images.SynchronousDelete()); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may explain some of the "internal: not found" errrors we sometimes see when building.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this behavior is not intuitive at all. I expect that delete to be sync by default and the async is activated by parameter, Good catch

@apostasie apostasie changed the title [DEBUGGING] Tentatively fix #3513 Tentatively fix #3513 Mar 31, 2025
@apostasie apostasie changed the title Tentatively fix #3513 Tentatively fix (parts of?) #3513 Mar 31, 2025
@apostasie apostasie marked this pull request as ready for review March 31, 2025 21:09
@apostasie
Copy link
Contributor Author

Green twice. 😓

@apostasie apostasie closed this Mar 31, 2025
@apostasie apostasie reopened this Mar 31, 2025
@apostasie apostasie closed this Mar 31, 2025
@apostasie apostasie reopened this Mar 31, 2025
@apostasie apostasie closed this Mar 31, 2025
@apostasie apostasie reopened this Mar 31, 2025
@apostasie apostasie closed this Mar 31, 2025
@apostasie apostasie reopened this Mar 31, 2025
@apostasie apostasie closed this Apr 1, 2025
@apostasie apostasie reopened this Apr 1, 2025
@apostasie apostasie closed this Apr 1, 2025
@apostasie apostasie reopened this Apr 1, 2025
@apostasie apostasie changed the title Tentatively fix (parts of?) #3513 Fix #3513 Apr 1, 2025
@apostasie apostasie changed the title Fix #3513 [PRIORITY] Fix #3513 Apr 1, 2025
@apostasie apostasie changed the title [PRIORITY] Fix #3513 [P0] Fix #3513 Apr 1, 2025
@apostasie apostasie closed this Apr 1, 2025
@apostasie apostasie reopened this Apr 1, 2025
@AkihiroSuda AkihiroSuda changed the title [P0] Fix #3513 [P0] Fix #3513 ([MAJOR] content digest not found) Apr 1, 2025
@AkihiroSuda AkihiroSuda added this to the v2.0.5 milestone Apr 1, 2025
@AkihiroSuda AkihiroSuda linked an issue Apr 1, 2025 that may be closed by this pull request
@AkihiroSuda AkihiroSuda requested review from a team April 1, 2025 15:44
Signed-off-by: apostasie <spam_blackhole@farcloser.world>
@@ -127,7 +127,7 @@ func Build(ctx context.Context, client *containerd.Client, options types.Builder
if _, err := imageService.Create(ctx, image); err != nil {
// if already exists; skip.
if errors.Is(err, errdefs.ErrAlreadyExists) {
if err = imageService.Delete(ctx, targetRef); err != nil {
if err = imageService.Delete(ctx, targetRef, images.SynchronousDelete()); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this behavior is not intuitive at all. I expect that delete to be sync by default and the async is activated by parameter, Good catch


img.Name = dstRef

_ = imageService.Delete(ctx, img.Name, images.SynchronousDelete())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to delete/re-create destRef if it already exists ?

Copy link
Contributor Author

@apostasie apostasie Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem like we do need to delete indeed (note that I am not 100% familiar with it yet).

Indeed, my containerd patch does away with the delete (which is the source of the problem), and instead tries to update, then if it fails tries to create.

This patch here is opting for the least amount of change, but of course happy to clean-up a bit more if we want.

lmk your preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fahedouch

Actually, I don't think this is the end of it (though I am convinced this patch will significantly enhance the situation).

I think in a follow-up PR we should thoroughly review our codebase and:

  • make sure we use leases where need be (GC might still happen and trigger similar symptoms)
  • evaluate why we are calling delete and what happens if we do not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to unblock the situation. But in the short and medium term, I don't want to move the containerd logic to nerdctl.

// Something seems wrong in converter.Convert.
// When dstRef != srcRef, convert will first forcefully delete dstRef,
// *asynchronously*, then create the image.
// This seems to cause a race conditions, and the deletion may kick in after the creation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lease is not maintained during the creation ? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue.

My gut intuition is that this thing faceplants when the old ref is actually exactly the same image as new ref.

Note that patching containerd to also use a sync delete does NOT fix the issue. 🤔

Anyhow, for me the bottom-line really is:

  • do not use leases + delete - clealy buggy
  • do not use async delete - counter-intuitive

Copy link
Member

@fahedouch fahedouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm to fix these problems that have been going on for a long time but the real fix should be done on the containerd side

@apostasie
Copy link
Contributor Author

lgtm to fix these problems that have been going on for a long time but the real fix should be done on the containerd side

Purely for the converter.Convert part, proposal is there: containerd/containerd#11628 if you feel like it is right.

I will eventually get to the true bottom of this pit and figure out if there is indeed a bug / race in the deletion mechanism (which I do believe there is), but I need time and motivation...

@apostasie
Copy link
Contributor Author

Suggesting we merge this soon.

I would like to get as much CI watching time with that patch merged-in before it gets into the general audience with the next patch release.

@fahedouch
Copy link
Member

fahedouch commented Apr 6, 2025

Suggesting we merge this soon.

I would like to get as much CI watching time with that patch merged-in before it gets into the general audience with the next patch release.

I would have preferred to see @AkihiroSuda 's opinion on this, but he is certainly busy with KubeCon. Let's merge this and roll back later if needed.

Copy link
Member

@fahedouch fahedouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the real fix should be done on the daemon side, but to unblock multiple issues we have on the CLI, let's merge the fix and see what the patch on the containerd side will bring.

@fahedouch fahedouch merged commit a43a2e0 into containerd:main Apr 6, 2025
30 checks passed
@AkihiroSuda
Copy link
Member

Thanks, LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MAJOR] content digest not found
3 participants