Cruise Control for Apache Kafka

CircleCI

Introduction

Cruise Control is a product that helps run Apache Kafka clusters at large scale. Due to the popularity of Apache Kafka, many companies have bigger and bigger Kafka clusters. At LinkedIn, we have 2.6K+ Kafka brokers, which means broker deaths are an almost daily occurrence and balancing the workload of Kafka also becomes a big overhead.

Kafka Cruise Control is designed to address this operation scalability issue.

Features

Kafka Cruise Control provides the following features out of the box:

Environment Requirements

Quick Start

  1. Get Cruise Control
    1. (Option-1): via git clone
      • git clone https://github.com/linkedin/cruise-control.git && cd cruise-control/
    2. (Option-2): via browsing the available releases:
      • Browse https://github.com/linkedin/cruise-control/releases to pick a release -- e.g. 0.1.10
      • Get and extract the release: wget https://github.com/linkedin/cruise-control/archive/0.1.10.tar.gz && tar zxvf 0.1.10.tar.gz && cd cruise-control-0.1.10/
      • Initialize the local repo: git init && git add . && git commit -m "Init local repo." && git tag -a 0.1.10 -m "Init local version."
  2. This step is required if CruiseControlMetricsReporter is used for metrics collection (i.e. the default for Cruise Control). The metrics reporter periodically samples the Kafka raw metrics on the broker and sends them to a Kafka topic.
    • ./gradlew jar
    • Copy ./cruise-control-metrics-reporter/build/libs/cruise-control-metrics-reporter-A.B.C.jar (Where A.B.C is the version of the Cruise Control) to your Kafka server dependency jar folder. For Apache Kafka, the folder would be core/build/dependant-libs-SCALA_VERSION/
    • Modify Kafka server configuration to set metric.reporters to com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter. For Apache Kafka, server properties are located at ./config/server.properties.
    • If SSL is enabled, ensure that the relevant client configurations are properly set for all brokers in ./config/server.properties. Note that CruiseControlMetricsReporter takes all configurations for vanilla KafkaProducer with a prefix of cruise.control.metrics.reporter. -- e.g. cruise.control.metrics.reporter.ssl.truststore.password.
    • If the default broker cleanup policy is compact, make sure that the topic to which Cruise Control metrics reporter should send messages is created with the delete cleanup policy -- the default metrics reporter topic is __CruiseControlMetrics.
  3. Start ZooKeeper and Kafka server (See tutorial).
  4. Modify config/cruisecontrol.properties of Cruise Control:
    • (Required) fill in bootstrap.servers and zookeeper.connect to the Kafka cluster to be monitored.
    • (Optional) set metric.sampler.class to your implementation (the default sampler class is CruiseControlMetricsReporterSampler)
    • (Optional) set sample.store.class to your implementation if you have one (the default SampleStore is KafkaSampleStore)
  5. Run the following command
    ./gradlew jar copyDependantLibs
    ./kafka-cruise-control-start.sh [-jars PATH_TO_YOUR_JAR_1,PATH_TO_YOUR_JAR_2] config/cruisecontrol.properties [port]

    JAR files correspond to your applications and port enables customizing the Cruise Control port number (default: 9090).

    • (Note) To emit Cruise Control JMX metrics on a particular port (e.g. 56666), export JMX_PORT=56666 before running kafka-cruise-control-start.sh
  6. (Verify your setup) Visit http://localhost:9090/kafkacruisecontrol/state (or http://localhost:\[port\]/kafkacruisecontrol/state if you specified the port when starting Cruise Control).

Note:

REST API

Cruise Control provides a REST API for users to interact with. See the wiki page for more details.

How Does It Work

Cruise Control relies on the recent load information of replicas to optimize the cluster.

Cruise Control periodically collects resource utilization samples at both broker- and partition-level to infer the traffic pattern of each partition. Based on the traffic characteristics and distribution of all the partitions, it derives the load impact of each partition over the brokers. Cruise Control then builds a workload model to simulate the workload of the Kafka cluster. The goal optimizer explores different ways to generate cluster workload optimization proposals based on the user-specified list of goals.

Cruise Control also monitors the liveness of all the brokers in the cluster. To avoid the loss of redundancy, Cruise Control automatically moves replicas from failed brokers to alive ones.

For more details about how Cruise Control achieves that, see these slides.

Configurations for Cruise Control

To read more about the configurations. Check the configurations wiki page.

Artifactory

Published at Jfrog Artifactory. See available releases.

Pluggable Components

More about pluggable components can be found in the pluggable components wiki page.

Metric Sampler

The metric sampler enables users to deploy Cruise Control to various environments and work with the existing metric systems.

Cruise Control provides a metrics reporter that can be configured in your Apache Kafka server. Metrics reporter generates performance metrics to a Kafka metrics topic that can be consumed by Cruise Control.

Sample Store

The Sample Store enables storage of collected metric samples and training samples in an external storage.

Metric sampling uses derived data from the raw metrics, and the accuracy of the derived data depends on the metadata of the cluster at that point. Hence, when we look at the old metrics, if we do not know the metadata at the point the metric was collected, the derived data would not be accurate. Sample Store helps solving this problem by storing the derived data directly to an external storage for later loading.

The default Sample Store implementation produces metric samples back to Kafka.

Goals

The goals in Cruise Control are pluggable with different priorities. The default goals in order of decreasing priority are:

Anomaly Notifier

The anomaly notifier allows users to be notified when an anomaly is detected. Anomalies include:

In addition to anomaly notifications, users can enable actions to be taken in response to an anomaly by turning self-healing on for the relevant anomaly detectors. Multiple anomaly detectors work in harmony using distinct mitigation mechanisms. Their actions broadly fall into the following categories: