-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491
[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491
Conversation
@@ -29,12 +29,18 @@ | |||
# AWSCLI_CONTAINER_ID | |||
################################### | |||
function aws_cli_start() { | |||
export AWSCLI_CONTAINER_ID=$(docker run -d \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it! I just left one comment. PTAL
@@ -58,7 +64,11 @@ function aws_cli_stop() { | |||
if [[ $AWSCLI_CONTAINER_ID ]]; then | |||
aws_cli_stop | |||
fi | |||
aws_cli_start | |||
aws_cli_start || aws_cli_start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will simple retry like this let the script be executed multiple times without proper cleanup of previous container, given the aws_cli_start is not Idempotent, afaiu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that docker run -d ...
with detach did not create a container if the exit code is non-zero. So failed aws_cli_start
shouldn't leave something to be cleaned. But I can't find docs stating that.
Happy to remove the retry, getting the test to fail faster with a more obvious cause is an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added in a failsafe to kill/remove if CONTAINER_ID
is non-empty after the docker run fails.
Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com>
Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com>
Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com>
Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com>
119cd86
to
4b233a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for working on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>
What is the purpose of the change
This pull request aims to make end-to-end test scripts that source
common_s3_operations.sh
fail fast if the aws cli container fails to start. It also adds a single naive retry aiming to recover from a transient network failure.FLINK-34569 describes an issue where an end-to-end test run took 15 minutes to fail after the aws cli container failed to start. From the test logs:
This failure isn't handled and so later we were stuck in a loop trying to docker exec commands like
docker exec -t "" command
.To test it locally I've been provoking docker run failures by changing the image name to something non-existent.
Brief change log
Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
I verified that it fails fast by modifying the awscli image to have a non-existant name, to provoke a
docker run
failure, causing it to fail like:Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation