Make Cruise Control resilient against bad metadata due to recreation of same topic #1726
Labels
correctness
A condition affecting the proper functionality.
robustness
Makes the project tolerate or handle perturbations.
Summary: In clusters, where a Kafka topic is deleted and then recreated with the same name, it is possible for Cruise Control (CC) to be stuck with a stale version of Kafka metadata. Due this staleness, CC might be unable to see a subset of such newly created topics in the cluster. As a result, (1)
kafka_cluster_state
endpoint may return a response that misses a subset of partitions in the cluster, (2) any CC operation that requires generating proposals (e.g.rebalance
,remove_broker
) might miss some partitions -- e.g. remove_broker might fail to drain all replicas from the removed broker due to lack of information about them in CC's metadata.Short-term mitigation: Bouncing CC instance forces it to refresh its cached metadata. If users encounter a case, where the metadata is stale, that is the fastest short-term mitigation.
Details: Enabled trace-level logs on CC to see the content of received metadata in a case, where we suspected metadata staleness. In this case, a topic that existed in the cluster was deleted, then recreated with the same name. CC was unable to show the partitions of that topic in the verbose response of
kafka_cluster_state
.The content of metadata showed that the topic information was indeed available in the response received from the broker – i.e.
However, CC logs show that the underlying metadata cache ignores this partition because its leader epoch is less than the local cached leader epoch:
The motivation behind this leader epoch is to avoid updating local metadata cache with stale metadata information for partitions. However, in this case, due to an earlier larger epoch of the previously existing topic partition, client cache fails to be updated unless the epoch of the new topic partition DeletedAndThenRecreatedTopic-0 eventually grows above the local cached epoch – i.e. Kafka resets the leader epoch of partition after its deletion, but since the locally cached leader epoch in CC's metadata client is unaware of that deletion, it continues to ignore the update for the new partition.
Note that this issue can only happen for partitions from topics that have been deleted and recreated with the same name in a succession.
Relevant: #1708
--
Note that the same issue exists in regular Kafka consumers (We reproduced the same issue as reported here: https://issues.apache.org/jira/browse/KAFKA-12257)
The text was updated successfully, but these errors were encountered: