-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workround for podman readlink error #1905
Conversation
When we detect that podman failed with the storage corruption warning and readlink no such file error, we will remove the image and set the command result to unavailable error to trigger a retry.
95ca7b5
to
9cd41b6
Compare
err = removeImage(ctx, c.image) | ||
if err != nil { | ||
log.Warningf("Failed to remove corrupted image: %s", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can inline:
if err := removeImage(ctx, c.image); err != nil {
log.Warningf("Failed to remove corrupted image: %s", err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -141,6 +143,13 @@ func (c *podmanCommandContainer) Run(ctx context.Context, command *repb.Command, | |||
podmanRunArgs = append(podmanRunArgs, c.image) | |||
podmanRunArgs = append(podmanRunArgs, command.Arguments...) | |||
result = runPodman(ctx, "run", nil, nil, podmanRunArgs...) | |||
if isStorageCorrupted(result) { | |||
result.Error = status.UnavailableError("podman run failed because of storage corruption") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about just "a storage corruption occurred"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if isStorageCorrupted(createResult) { | ||
createResult.Error = status.UnavailableError("podman create failed because of storage corruption") | ||
err = removeImage(ctx, c.image) | ||
if err != nil { | ||
log.Warningf("Failed to remove corrupted image: %s", err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you move this to a new helper function you can use in both places?
func maybeCleanupCorruptedImages()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -164,6 +173,13 @@ func (c *podmanCommandContainer) Create(ctx context.Context, workDir string) err | |||
if err = createResult.Error; err != nil { | |||
return status.UnavailableErrorf("failed to create container: %s", err) | |||
} | |||
if isStorageCorrupted(createResult) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to happen before the check on line 173?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit
if !storageErrorRegex.MatchString(string(result.Stderr)) { | ||
return nil | ||
} | ||
result.Error = status.UnavailableError("a storage corruption occurred for podman") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set result.ExitCode = commandutil.NoExitCode
here -- the convention for CommandResult is that ExitCode>=0
if and only if Error==nil
, since an Error
means that the process itself did not run normally and return a valid exit code in the range 0-255. In this case podman is the thing returning the exit code, and podman is sort of an implementation detail that we shouldn't expose in the Run()/Exec() API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for the explanation!
When we detect that podman failed with the storage corruption warning
and readlink no such file error, we will remove the image and set the
command result to unavailable error to trigger a retry.