Cruise Control is a product that helps run Apache Kafka clusters at large scale. Due to the popularity of Apache Kafka, many companies have bigger and bigger Kafka clusters. At LinkedIn, we have 2.6K+ Kafka brokers, which means broker deaths are an almost daily occurrence and balancing the workload of Kafka also becomes a big overhead.
Kafka Cruise Control is designed to address this operation scalability issue.
Kafka Cruise Control provides the following features out of the box:
Resource utilization tracking for brokers, topics, and partitions.
Query the current Kafka cluster state to see the online and offline partitions, in-sync and out-of-sync replicas,
min.insync.replicas, online and offline logDirs, and distribution of replicas in the cluster.
Multi-goal rebalance proposal generation for:
Anomaly detection, alerting, and self-healing for the Kafka cluster, including:
Admin operations, including:
masterbranch of Cruise Control is compatible with Apache Kafka
2.3(i.e. Releases with
kafka_0_11_and_1_0branch of Cruise Control is compatible with Apache Kafka
1.1(i.e. Releases with
migrate_to_kafka_2_4(development) branch of Cruise Control is compatible with Apache Kafka
2.4(i.e. Releases with
migrate_to_kafka_2_5(development) branch of Cruise Control is compatible with Apache Kafka
2.5(i.e. Releases with
0.10.0and above is needed
kafka_0_11_and_1_0branch compile with
git clone https://github.com/linkedin/cruise-control.git && cd cruise-control/
https://github.com/linkedin/cruise-control/releasesto pick a release -- e.g.
wget https://github.com/linkedin/cruise-control/archive/0.1.10.tar.gz && tar zxvf 0.1.10.tar.gz && cd cruise-control-0.1.10/
git init && git add . && git commit -m "Init local repo." && git tag -a 0.1.10 -m "Init local version."
CruiseControlMetricsReporteris used for metrics collection (i.e. the default for Cruise Control). The metrics reporter periodically samples the Kafka raw metrics on the broker and sends them to a Kafka topic.
A.B.Cis the version of the Cruise Control) to your Kafka server dependency jar folder. For Apache Kafka, the folder would be
com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter. For Apache Kafka, server properties are located at
SSLis enabled, ensure that the relevant client configurations are properly set for all brokers in
./config/server.properties. Note that
CruiseControlMetricsReportertakes all configurations for vanilla
KafkaProducerwith a prefix of
compact, make sure that the topic to which Cruise Control metrics reporter should send messages is created with the
deletecleanup policy -- the default metrics reporter topic is
config/cruisecontrol.propertiesof Cruise Control:
zookeeper.connectto the Kafka cluster to be monitored.
metric.sampler.classto your implementation (the default sampler class is
sample.store.classto your implementation if you have one (the default
./gradlew jar copyDependantLibs ./kafka-cruise-control-start.sh [-jars PATH_TO_YOUR_JAR_1,PATH_TO_YOUR_JAR_2] config/cruisecontrol.properties [port]
JAR files correspond to your applications and
port enables customizing the Cruise Control port number (default:
export JMX_PORT=56666before running
http://localhost:\[port\]/kafkacruisecontrol/stateif you specified the port when starting Cruise Control).
Cruise Control provides a REST API for users to interact with. See the wiki page for more details.
Cruise Control relies on the recent load information of replicas to optimize the cluster.
Cruise Control periodically collects resource utilization samples at both broker- and partition-level to infer the traffic pattern of each partition. Based on the traffic characteristics and distribution of all the partitions, it derives the load impact of each partition over the brokers. Cruise Control then builds a workload model to simulate the workload of the Kafka cluster. The goal optimizer explores different ways to generate cluster workload optimization proposals based on the user-specified list of goals.
Cruise Control also monitors the liveness of all the brokers in the cluster. To avoid the loss of redundancy, Cruise Control automatically moves replicas from failed brokers to alive ones.
For more details about how Cruise Control achieves that, see these slides.
To read more about the configurations. Check the configurations wiki page.
More about pluggable components can be found in the pluggable components wiki page.
The metric sampler enables users to deploy Cruise Control to various environments and work with the existing metric systems.
Cruise Control provides a metrics reporter that can be configured in your Apache Kafka server. Metrics reporter generates performance metrics to a Kafka metrics topic that can be consumed by Cruise Control.
The Sample Store enables storage of collected metric samples and training samples in an external storage.
Metric sampling uses derived data from the raw metrics, and the accuracy of the derived data depends on the metadata of the cluster at that point. Hence, when we look at the old metrics, if we do not know the metadata at the point the metric was collected, the derived data would not be accurate. Sample Store helps solving this problem by storing the derived data directly to an external storage for later loading.
The default Sample Store implementation produces metric samples back to Kafka.
The goals in Cruise Control are pluggable with different priorities. The default goals in order of decreasing priority are:
kafka_0_11_and_1_0branch) Ensures that Disk space usage of each disk is below a given threshold.
kafka_0_11_and_1_0branch) Attempts to keep the Disk space usage variance among disks within a certain range relative to the average broker Disk utilization.
The anomaly notifier allows users to be notified when an anomaly is detected. Anomalies include:
In addition to anomaly notifications, users can enable actions to be taken in response to an anomaly by turning self-healing on for the relevant anomaly detectors. Multiple anomaly detectors work in harmony using distinct mitigation mechanisms. Their actions broadly fall into the following categories: