Description
The example/default configuration file lists 5 servers:
This looks like a mistake. It happens to work because there are 5 nodes defined and with 3 nodes in the statefulset, zookeeper will consider that a quorum. However, it is extremely fragile as any outage of a single node will bring the ZK cluster (and hence, the kafka deployment) to hard down; eg bootstrap times out:
$ kafkacat -b k8s.internal.example.com:32401 -L
% ERROR: Failed to acquire metadata: Local: Broker transport failure
Observed log lines from zookeeper:
Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running (org.apache.zookeeper.server.NIOServerCnxn)
I will be testing this theory out soon by removing these two lines and seeing if zk stays happy with a single statefulset node failure.
While I'm here, the statefulsets should be defined with a parallel Pod Management policy so that if eg broker 0 goes down, the statefulset doesn't do a rolling restart of brokers 1+, and the system can recover from multi-node failures faster.