[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491

robobario · 2024-03-13T23:07:47Z

What is the purpose of the change

This pull request aims to make end-to-end test scripts that source common_s3_operations.sh fail fast if the aws cli container fails to start. It also adds a single naive retry aiming to recover from a transient network failure.

FLINK-34569 describes an issue where an end-to-end test run took 15 minutes to fail after the aws cli container failed to start. From the test logs:

2024-03-02T04:10:55.5496990Z Unable to find image 'banst/awscli:latest' locally 2024-03-02T04:10:56.3857380Z docker: Error response from daemon: Head "https://registry-1.docker.io/v2/banst/awscli/manifests/latest": read tcp 10.1.0.97:33016->54.236.113.205:443: read: connection reset by peer. 2024-03-02T04:10:56.3857877Z See 'docker run --help'. 2024-03-02T04:10:56.4586492Z Error: No such object:

This failure isn't handled and so later we were stuck in a loop trying to docker exec commands like docker exec -t "" command.

To test it locally I've been provoking docker run failures by changing the image name to something non-existent.

Brief change log

Fail fast if aws cli container fails to run
Add naive retry when creating aws cli container
Add --rm to jq docker run commands to remove them on exit

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

I verified that it fails fast by modifying the awscli image to have a non-existant name, to provoke a docker run failure, causing it to fail like:

==============================================================================
Running 'test-scripts/test_file_sink.sh s3 StreamingFileSink skip_check_exceptions'
==============================================================================
TEST_DATA_DIR: /home/roby/development/redhat-managed-kafka/upstream/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53909550201
Flink dist directory: /home/roby/development/redhat-managed-kafka/upstream/flink/flink-dist/target/flink-1.20-SNAPSHOT-bin/flink-1.20-SNAPSHOT
Found AWS bucket robeyoun-testing-flink-13-03-2024, running the e2e test.
Found AWS access key, running the e2e test.
Found AWS secret key, running the e2e test.
Unable to find image 'banstz/awscli:latest' locally
docker: Error response from daemon: pull access denied for banstz/awscli, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
running aws cli container failed
Unable to find image 'banstz/awscli:latest' locally
docker: Error response from daemon: pull access denied for banstz/awscli, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
running aws cli container failed
running the aws cli container failed
[FAIL] Test script contains errors.
Checking for errors...
No errors in log files.
Checking for exceptions...
No exceptions in log files.
Checking for non-empty .out files...
grep: /home/roby/development/redhat-managed-kafka/upstream/flink/build-target/log/*.out: No such file or directory
No non-empty .out files.

[FAIL] 'test-scripts/test_file_sink.sh s3 StreamingFileSink skip_check_exceptions' failed after 0 minutes and 6 seconds! Test exited with exit code 1

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers:no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

robobario · 2024-03-13T23:09:14Z

flink-end-to-end-tests/test-scripts/common_s3_operations.sh

@@ -29,12 +29,18 @@
 #   AWSCLI_CONTAINER_ID
 ###################################
 function aws_cli_start() {
-  export AWSCLI_CONTAINER_ID=$(docker run -d \


see https://www.shellcheck.net/wiki/SC2155

flinkbot · 2024-03-13T23:14:30Z

CI report:

4b233a7 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

JingGe

Thanks for working on it! I just left one comment. PTAL

JingGe · 2024-03-14T17:57:10Z

flink-end-to-end-tests/test-scripts/common_s3_operations.sh

@@ -58,7 +64,11 @@ function aws_cli_stop() {
 if [[ $AWSCLI_CONTAINER_ID ]]; then
  aws_cli_stop
 fi
-aws_cli_start
+aws_cli_start || aws_cli_start


Will simple retry like this let the script be executed multiple times without proper cleanup of previous container, given the aws_cli_start is not Idempotent, afaiu.

I'm assuming that docker run -d ... with detach did not create a container if the exit code is non-zero. So failed aws_cli_start shouldn't leave something to be cleaned. But I can't find docs stating that.

Happy to remove the retry, getting the test to fail faster with a more obvious cause is an improvement.

I've added in a failsafe to kill/remove if CONTAINER_ID is non-empty after the docker run fails.

Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com>

Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com>

Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com>

Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com>

rmetzger

Thx for working on this.

JingGe

LGTM

…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>

) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>

…che#24491) * [FLINK-34569][e2e] Fail fast if aws cli container fails to run Why: An end-to-end test run failed and in the test logs you could see that the AWS cli container failed to start. Because of the way it's organised the failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was empty. This lead to a loop trying to docker exec a command in a container named "" and the test taking 15 minutes to time out. This change speeds up the failure. Note that we use 'return' to prevent an immediate failure of the script so that we have the potential to implement a simple retry. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Add naive retry when creating aws cli container Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Remove jq containers after user Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com> * [FLINK-34569][e2e] Clean up after failed awscli container run Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com> --------- Signed-off-by: Robert Young <robeyoun@redhat.com>

robobario commented Mar 13, 2024

View reviewed changes

JingGe reviewed Mar 14, 2024

View reviewed changes

robobario added 4 commits March 18, 2024 09:13

[FLINK-34569][e2e] Add naive retry when creating aws cli container

92354f8

Why: An end-to-end test run failed with what looked like a transient network exception when pulling the aws cli image. This retries once. Signed-off-by: Robert Young <robeyoun@redhat.com>

[FLINK-34569][e2e] Remove jq containers after user

ca05028

Why: A large pile of exited jq containers were left in docker after an operation was retried repeatedly. Signed-off-by: Robert Young <robeyoun@redhat.com>

[FLINK-34569][e2e] Clean up after failed awscli container run

4b233a7

Why: If for some reason the command can return a non-zero exit code and also create a container, this will remove it so we don't have an orphan sitting stranded. Signed-off-by: Robert Young <robeyoun@redhat.com>

robobario force-pushed the FLINK-34569-end-to-end-test-timeout branch from 119cd86 to 4b233a7 Compare March 17, 2024 20:13

robobario requested a review from JingGe March 19, 2024 04:13

rmetzger approved these changes Jun 5, 2024

View reviewed changes

JingGe approved these changes Jun 5, 2024

View reviewed changes

rmetzger merged commit b8d5271 into apache:master Jun 5, 2024

This was referenced Jun 5, 2024

[Flink 34569][e2e] fail fast if AWS cli container fails to start (#24… #24893

Merged

[FLINK-34569][e2e] fail fast if AWS cli container fails to start (#24… #24894

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491

[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491

robobario commented Mar 13, 2024

robobario Mar 13, 2024

flinkbot commented Mar 13, 2024 •

edited

Loading

JingGe left a comment

JingGe Mar 14, 2024 •

edited

Loading

robobario Mar 15, 2024

robobario Mar 17, 2024

rmetzger left a comment

JingGe left a comment

[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491

[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491

Conversation

robobario commented Mar 13, 2024

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

robobario Mar 13, 2024

Choose a reason for hiding this comment

flinkbot commented Mar 13, 2024 • edited Loading

CI report:

JingGe left a comment

Choose a reason for hiding this comment

JingGe Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

robobario Mar 15, 2024

Choose a reason for hiding this comment

robobario Mar 17, 2024

Choose a reason for hiding this comment

rmetzger left a comment

Choose a reason for hiding this comment

JingGe left a comment

Choose a reason for hiding this comment

flinkbot commented Mar 13, 2024 •

edited

Loading

JingGe Mar 14, 2024 •

edited

Loading