-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check node or cluster health in post-start script #87
Comments
We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story. The labels on this github issue will be updated when the story is started. |
Hi Matthias, Sorry to hear you had a cluster downtime. Your idea sounds great improvement indeed. Would you mind submitting the PR? We'll review it next week. Thank you, |
@mkuratczyk thanks for getting back to me! I created the PR in #89 |
The RabbitMQ app might not be able to start for genuine reasons, the most common one being a unavailable peers to complete Mnesia sync which is a boot step prior to starting rabbitmq app. In this common scenario, BOSH should continue with the next instance and not stop. A failing If there was a genuine issue for the RabbitMQ app crashing and not starting, then this should bring the entire Erlang VM down, meaning a changing PID that Monit and ultimately BOSH would notice and correctly mark the canary node as failing. A crashing RabbitMQ app would not bring the Erlang VM down until v3.6.13, I'm wondering if you were experiencing this known & already fixed issue: rabbitmq/rabbitmq-common#237 To guard against RabbitMQ app crashing after Mnesia sync-ing completes, it would make sense to run a node & cluster check in a |
@nodo About your idea in #89 to investigate the logs:
This occurs multiple times for different queues. I'm afraid I'm not allowed to share the whole logs (since it might contain customer information). post deploy logs: node1:
node2:
node3:
|
Hi all
While changing configuration of our rabbitmq deployment lately, we ran into the following issue:
Result: The whole cluster was down. The post-deploy script then failed on the nodes, so we noticed something was wrong.
If we would add a
post-start
script, this scenario could not happen.Since
post-start
are run on each VM before BOSH marks it as healthy, individual node failures would be caught before moving on with the next instance.UAA already does that in a similar fashion: https://github.com/cloudfoundry/uaa-release/blob/develop/jobs/uaa/templates/bin/post-start
What do you think? We're happy to provide the
post-start
as a PR.The text was updated successfully, but these errors were encountered: