-
Notifications
You must be signed in to change notification settings - Fork 667
[P0] Fix #3513 ([MAJOR] content digest not found
)
#4062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Green once. 🤔 |
@@ -127,7 +127,7 @@ func Build(ctx context.Context, client *containerd.Client, options types.Builder | |||
if _, err := imageService.Create(ctx, image); err != nil { | |||
// if already exists; skip. | |||
if errors.Is(err, errdefs.ErrAlreadyExists) { | |||
if err = imageService.Delete(ctx, targetRef); err != nil { | |||
if err = imageService.Delete(ctx, targetRef, images.SynchronousDelete()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may explain some of the "internal: not found" errrors we sometimes see when building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this behavior is not intuitive at all. I expect that delete to be sync
by default and the async is activated by parameter, Good catch
Green twice. 😓 |
[MAJOR] content digest not found
)
Signed-off-by: apostasie <spam_blackhole@farcloser.world>
@@ -127,7 +127,7 @@ func Build(ctx context.Context, client *containerd.Client, options types.Builder | |||
if _, err := imageService.Create(ctx, image); err != nil { | |||
// if already exists; skip. | |||
if errors.Is(err, errdefs.ErrAlreadyExists) { | |||
if err = imageService.Delete(ctx, targetRef); err != nil { | |||
if err = imageService.Delete(ctx, targetRef, images.SynchronousDelete()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this behavior is not intuitive at all. I expect that delete to be sync
by default and the async is activated by parameter, Good catch
|
||
img.Name = dstRef | ||
|
||
_ = imageService.Delete(ctx, img.Name, images.SynchronousDelete()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to delete/re-create destRef
if it already exists ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not seem like we do need to delete indeed (note that I am not 100% familiar with it yet).
Indeed, my containerd patch does away with the delete (which is the source of the problem), and instead tries to update
, then if it fails tries to create
.
This patch here is opting for the least amount of change, but of course happy to clean-up a bit more if we want.
lmk your preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think this is the end of it (though I am convinced this patch will significantly enhance the situation).
I think in a follow-up PR we should thoroughly review our codebase and:
- make sure we use leases where need be (GC might still happen and trigger similar symptoms)
- evaluate why we are calling
delete
and what happens if we do not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to unblock the situation. But in the short and medium term, I don't want to move the containerd logic to nerdctl.
// Something seems wrong in converter.Convert. | ||
// When dstRef != srcRef, convert will first forcefully delete dstRef, | ||
// *asynchronously*, then create the image. | ||
// This seems to cause a race conditions, and the deletion may kick in after the creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lease is not maintained during the creation ? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue.
My gut intuition is that this thing faceplants when the old ref is actually exactly the same image as new ref.
Note that patching containerd to also use a sync delete does NOT fix the issue. 🤔
Anyhow, for me the bottom-line really is:
- do not use leases + delete - clealy buggy
- do not use async delete - counter-intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm to fix these problems that have been going on for a long time but the real fix should be done on the containerd side
Purely for the I will eventually get to the true bottom of this pit and figure out if there is indeed a bug / race in the deletion mechanism (which I do believe there is), but I need time and motivation... |
Suggesting we merge this soon. I would like to get as much CI watching time with that patch merged-in before it gets into the general audience with the next patch release. |
I would have preferred to see @AkihiroSuda 's opinion on this, but he is certainly busy with KubeCon. Let's merge this and roll back later if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the real fix should be done on the daemon side, but to unblock multiple issues we have on the CLI, let's merge the fix and see what the patch on the containerd side will bring.
Thanks, LGTM |
Current status
Discussion and patch upstream: containerd/containerd#11628
TL;DR here: this PR here on nerdctl does seem to fix our issues (many runs of the CI with no issue), because it entirely bypasses
converter.Convert
Delete
logic which seems to trigger some underlying containerd issue related to lease and deletion (<- my opinion, unverified).My patch upstream does ALSO fix our issue, with a clear reproducer, although it probably just avoids the underlying containerd problem.
Original post below
It looks like containerd
imagestore.Delete
is asynchronous by default.If you want a synchronous delete, you have to pass an additional opt.
Of course, this seems to be a problem when you:
This seem to be the case (actually, it is worse) for https://github.com/containerd/containerd/blob/af013606262f9df1de4cc80688725a15a375fb3a/core/images/converter/converter.go#L120-L121
... where convert.Convert just forcefully delete, whether or not the image existed, async...
This of course is also something we are doing internally.
This PR addresses these issues:
At this point, this is still a hunch - but for the first time I feel confident that we may be reaching the bottom of it, as this explanation closely matches the soul-destroying symptoms we have witnessed in eg: #3509 (comment)
and more generally, with anything touching containerd
Convert
(see #3513).I will close an open this PR many times and confirm that the infamous #3513 is finally gone, but I would like the expert opinion of @lingdie on this - if you have time for a quick review.
Upstream PR: containerd/containerd#11628
Note that this PR here does not try to do more than that - there MAY still be places where we would need to
EnsureContent
.EDIT: looks like containerd maintainer disagree with this, so, maybe I am misreading this.
Anyhow, I will keep closing / opening and run the CI - we are currently at: