Health Check for Kafka Brokers.
Health checker for Kafka brokers and clusters that operates by checking whether:
Release version is 0.1.0
Compiled binaries are available for Linux, macOS, and FreeBSD.
Submit a pull request to have your use case listed here!
Self-healing cluster
At AutoScout24, in order to reduce operational workload, we use kafka-health-check to automatically restart broker nodes as they become unhealthy.
In-place rolling updates
At AutoScout24, to keep the OS up to date of our clusters running on AWS, we perform regular in-place rolling updates. As we run immutable servers, we terminate each broker and replace them with fresh EC2 instances (keeping the previous broker ids). In order not to jeopardy the cluster stability when terminating brokers, we verify that the cluster is healthy before taking one broker offline. Similarly, we wait for the broker coming back online to fully catch up before proceeding with the next broker. To achieve this, we use the cluster health information provided by kafka-health-check.
Usage of kafka-health-check: -broker-host string ip address or hostname of broker host (default "localhost") -broker-id uint id of the Kafka broker to health check -broker-port uint Kafka broker port (default 9092) -check-interval duration how frequently to perform health checks (default 10s) -no-topic-creation disable automatic topic creation and deletion -replication-failures-count uint number of replication failures before broker is reported unhealthy (default 5) -replication-topic string name of the topic to use for replication checks - use one per cluster, defaults to broker-replication-check -server-port uint port to open for http health status queries (default 8000) -topic string name of the topic to use - use one per broker, defaults to broker--health-check -zookeeper string ZooKeeper connect string (e.g. node1:2181,node2:2181,.../chroot)
Broker health can be queried at
/:
$ curl -s :8000/ { "broker": 1, "status": "sync" }
Return codes and status values are: *
200with
syncfor a healthy broker that is fully in sync with all leaders. *
200with
imokfor a healthy broker that replays messages of its health check topic, but is not fully in sync. *
500with
nookfor an unhealthy broker that fails to replay messages in its health check topic within 200 milliseconds or if it fails to stay in the ISR of the replication check topic for more checks than
replication-failures-count(default 5).
The returned json contains details about replicas the broker is lagging behind:
$ curl -s :8000/ { "broker": 3, "status": "imok", "out-of-sync": [ { "topic": "mytopic", "partition": 0 } ], "replication-failures": 1 }
Cluster health can be queried at
/cluster:
$ curl -s :8000/cluster { "status": "green" }
Return codes and status values are: *
200with
greenif all replicas of all partitions of all topics are in sync and metadata is consistent. *
200with
yellowif one or more partitions are under-replicated and metadata is consistent. *
500with
redif one or more partitions are offline or metadata is inconsistent.
The returned json contains details about metadata status and partition replication:
$ curl -s :8000/cluster { "status": "yellow", "topics": [ { "topic": "mytopic", "status": "yellow", "partitions": { "1": { "status": "yellow", "OSR": [ 3 ] }, "2": { "status": "yellow", "OSR": [ 3 ] } } } ] }
The fields for additional info and structures are: *
topicsfor topic replication status:
[{"topic":"mytopic","status":"yellow","partitions":{"2":{"status":"yellow","OSR":[3]}}}]In this data,
OSRmeans out-of-sync replica and contains the list of all brokers that are not in the ISR. *
metadatafor inconsistencies between ZooKeeper and Kafka metadata:
[{"broker":3,"status":"red","problem":"Missing in ZooKeeper"}]*
zookeeperfor problems with ZooKeeper connection or data, contains a single string:
"Fetching brokers failed: ..."
Tested with the following Kafka versions:
Kafka 0.8 is not supported.
see the compatibility spec for the full list of executed compatibility checks. To execute the compatibility checks, run
make compatibility. Running the checks requires Docker.
Run
maketo build after running
make depsto restore the dependencies using govendor.
-no-topic-creation.
-no-topic-creation.
red.