Is your Kafka cluster really highly available?
As developers, we take the high availability guarantee of Kafka for granted. But there are a few nuances that need to be taken into consideration before we can be sure of this guarantee.
In this article, I will describe one such nuance and the process to detect and fix any such misconfiguration of your cluster.
It is common for Kafka production deployments to have a multi-broker cluster on which you host many topics. __consumer_offsets
is one of the core topics of Kafka which is critical for the well functioning of the consumers and hence ensuring the overall cluster availability.
What is so special about the __consumer_offsets
topic?
When a consumer starts reading from a topic’s partition, it needs to understand from which point in the partition the consumption should begin. This is where__consumer_offsets
topic comes into the picture. The core functionality of this topic is consumer offset tracking. You can read more about this here.
That sounds interesting but how does this topic impact my cluster availability?
Even though the default config for this topic is mentioned as 50 partitions with a replication factor of 3, if you download the latest build of Kafka this is what the server.properties file contains -
offsets.topic.replication.factor=1
If your topic is a critical one this is an extraordinarily bad idea.
Let us assume that you went ahead with this config having a replication factor of 1 for the __consumer_offsets
topic for your Kafka cluster with 3 brokers. The topic partitions will be spread evenly across all the 3 brokers.
In the event of one of the brokers going down e.g. your EC2 instance on which the cluster was hosted had to restart, the cluster will become unavailable to the consumers.
Wait… why?
I will give a brief example -
Remember, a replication factor of 1 implies NO PARTITION REPLICAS.
When we have 3 brokers, the 50 partitions of the __consumer_offsets
topic will be mapped onto the brokers like this.
B1 -> p1 , p4, p7, p10 …….., p50
B2 -> p2, p5, p8, p11 …….., p49
B3 -> p3, p6, p9, p12 …….., p48
Now, what will happen if broker B3 becomes unavailable? Since partitions p3, p6, p9, p12 ….., p48 have only one replica which was hosted on broker B3, these partitions become unavailable. As a result of this, the topic (i.e. __consumer_offsets
) itself becomes unavailable.
So what?
Any consumer which is consuming from a topic hosted on the given Kafka cluster has to commit its offsets to the __consumer_offsets
topic. No topic to commit offsets to implies that no consumer will be able to consume from ANY topic hosted on this cluster no matter what the replication factor was for the topic that is being consumed is. As a result of this, the availability of the cluster goes for a toss.
You can read up in detail about the __consumer_offsets
topic in detail here.
How do I check if this bad config is present in my Kafka cluster?
You can use the kafka-topics utility to describe the __consumer_offsets
topic. > bin/kafka-topics.sh --bootstrap-server localhost:9092 --topic __consumer_offsets --describe
If the output of the above command mentions ReplicationFactor
as 1, then hop on to the next step to fix it.
How do we fix this?
You can use the out-of-the-box kafka-reassign-partitions
utility to increase the replication factor. The steps to execute this operation are mentioned in the official Kafka documentation here.
You will need to create your custom reassignment-json-file and execute it with the above-mentioned utility.