Description
Ask a question
Hi
I've been meaning to provide feedback on the upgrade functionality for quite some time, but life have gotten in the way. Maybe this should have been multiple issues, and I've might have missed some details or points, but it is what is it.
Latest elastic docs and the implemented process have some differences:
- Stopping ML nodes is not implemented.
- Docs says
"cluster.routing.allocation.enable": "primaries"
vs implementednone
- Docs recommends upgrading tier-by-tier (frozen-cold-warm-hot)
Things I have observed during testing:
- Wait period for a cluster to return to Green status is not always long enough
- Sometimes cluster never returns to Green status as there are no eligible nodes for the replica shards
- If a node fails, the entire play should abort. Currently it just drops the node that failed, and keeps running for the rest of the nodes.
Questions:
Is the "cluster.routing.allocation.enable"
based on earlier recommendations, or is there another reason to choose none
over primaries
?
My biggest blocker currently is that the cluster remains in a yellow state when there are replicas with no eligible nodes. The Docs says to proceed with the upgrade in these cases. This means we would have to check init
and relo
columns in _cat/health?v=true
. This might either be trivial or far-from-trivial, not sure to be honest.
Regarding failing entire play vs node, this might be something in my ansible setup, or something in my playbook. I've not had time to give this a hard look yet.
Adding a task to start/stop ML nodes should be trivial, I might drop a PR for this if/when I find the time.