-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10179
Conversation
Seriously, my only theory is that it's something takes too much time for a container to bring up and make the health check fails, I am currently looking to expose the docker compose logs. Because debugging this on Travis is not really fun. What is frustrating is the docker compose file of x-pack is really close to the other file that doesn't fail.. I am thinking that the password setup that we do in the beginning of the ES container is making the thing goes boom, of course it doesn't happen always. |
Do not merge this PR is for testing. |
From my experience at looking at theses log they always seems to fails on the health check for elasticsearch.
|
would we get some more info by running docker-compose with |
After talking with security the main problem here is our health check is actually wrong (and might be wrong in our other docker compose files). Security is just an index, when ES is started user are added to the Security index, If ES respond to calls it doesn't mean that the cluster is green, it doesn't mean that the security index is accessible, until the cluster is green it is possible that looking for user effectively fails. |
9491a9b
to
10f35c4
Compare
This reverts commit edeed09. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs
10f35c4
to
13345b6
Compare
environment: | ||
- "ES_JAVA_OPTS=-Xms512m -Xmx512m" | ||
- "network.host=" | ||
- "transport.host=127.0.0.1" | ||
- "http.host=0.0.0.0" | ||
- "xpack.security.enabled=true" | ||
- "xpack.license.self_generated.type=trial" | ||
command: bash -c "bin/elasticsearch-keystore create && echo changeme | bin/elasticsearch-keystore add --stdin bootstrap.password && /usr/local/bin/docker-entrypoint.sh eswrapper" | ||
restart: on-failure:5 | ||
- ELASTIC_PASSWORD=changeme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the investigation. Probably the health check should also be updated in testing/environment
setup?
@@ -51,6 +48,6 @@ services: | |||
healthcheck: | |||
test: ["CMD-SHELL", 'python -c ''import urllib, json; response = urllib.urlopen("http://elastic:changeme@localhost:5601/api/status"); data = json.loads(response.read()); exit(1) if data["status"]["overall"]["state"] != "green" else exit(0);'''] | |||
retries: 1200 | |||
interval: 1s | |||
interval: 5s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this quite long delay and lower interval? This will slow down builds I would assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well slow down is actually relative in the healthcheck, if you wait for 5s instead of 1s, it's 4 fewer requests ES has to deal before the next check. Since Travis is not the fastest thing in the world, giving it more time to start is a good compromise. You see that I've only modified theses limit for the x-pack/libbeat/docker-compose.yml
and not the other files, it's actually because it takes more times to start and recover the security index to allow authentification.
@ruflin if the test suite becomes flaky or unreliable we will have to switch to file-based authentification for ES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope that ES can cope with 1 request per sec ;-) For the start_period, tried to find some good docs but I assume it waits initially 60s? Seems a bit too much.
Overall my goal is to keep the CI time down, as soon as a service is available, tests should start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think too, but well its Travis its docker, I prefer to give it as much chance as possible, maybe call that I am fed up with that and I would prefer to not whack a mole more :D
The start_period
is a really bad name, it should be called grace period
.
When the container starts, it means that health check that fails will not increase the fail count but if the health check succeeds during that period the container can be marked as healthy, when this happen the remaining time of the start_period
is ignored. So it doesn't really impact that much the startup it just gives a bit more time to be healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, good argument ;-)
|
||
elasticsearch: | ||
extends: | ||
file: ${ES_BEATS}/testing/environments/${TESTING_ENVIRONMENT}.yml | ||
service: elasticsearch | ||
healthcheck: | ||
test: ["CMD", "curl", "-u", "elastic:changeme", "-f", "http://localhost:9200"] | ||
test: ["CMD-SHELL", 'python -c ''import urllib, json; response = urllib.urlopen("http://elastic:changeme@localhost:9200/_cluster/health"); data = json.loads(response.read()); exit(1) if data["status"] != "green" else exit(0);'''] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this should only matter for x-pack or anything that uses security, I didn't want to change anything on other docker-compose file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruflin Just to clarify, I will not change the healthcheck for the testing/environment files, become as of today nothing assume that a cluster need to be green other than security, but as we add more test or when/if metricbeat has integration tests that requires security we could move all the docker-compose files to that strategy.
retries: 1200 | ||
interval: 1s | ||
interval: 5s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the old one again to speed up bulids?
@@ -202,7 +202,7 @@ integration-tests: prepare-tests | |||
.PHONY: integration-tests-environment | |||
integration-tests-environment: ## @testing Runs the integration inside a virtual environment. This can be run on any docker-machine (local, remote) | |||
integration-tests-environment: prepare-tests build-image | |||
${DOCKER_COMPOSE} run beat make integration-tests RACE_DETECTOR=$(RACE_DETECTOR) DOCKER_COMPOSE_PROJECT_NAME=${DOCKER_COMPOSE_PROJECT_NAME} | |||
${DOCKER_COMPOSE} run beat make integration-tests RACE_DETECTOR=$(RACE_DETECTOR) DOCKER_COMPOSE_PROJECT_NAME=${DOCKER_COMPOSE_PROJECT_NAME} || docker inspect --format "{{json .State.Health }}" $$(${DOCKER_COMPOSE} ps -q) | jq && ${DOCKER_COMPOSE} logs --tail 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was for debugging purpose. Should we keep it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruflin YES we should, because when docker-compose
fails it can be hard to either reproduce it or get the logs, it took me a few tries to get an idea how to have useful debug information and when I finally found a good way it took 50 runs (7m each) to finally make it fails..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above will do two things:
- Display the last information for the healthcheck.
- Display 200 log lines for each container.
The 2, was why I was able to actually make sense of this error, Logging at the log it was clearly demonstrating that ES was actually not in a green state and the authentification failed.
…10259) * Enable back CM integration suite This reverts commit edeed09. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs
…10260) * Enable back CM integration suite This reverts commit edeed09. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs
…astic#10179) (elastic#10260) * Enable back CM integration suite This reverts commit 0724101. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs
This reverts commit edeed09.
Looking at this with a fresh eyes, I am not sure its flakyness, we should get log from the docker containers on failures.