build status

kafka-workers

Kafka Workers is a client library which unifies records consuming from Kafka and processing them by user-defined WorkerTasks. It provides:

Version

Current version is 1.0.16

Requirements

You need Java 11 and at least Apache Kafka 2.0 to use this library.

Installation

Releases are distributed on mvn repository:

<dependency>
    <groupId>com.rtbhouse</groupId>
    <artifactId>kafka-workers</artifactId>
    <version>1.0.16</version>
</dependency>

Usage

To use Kafka Workers you should implement the following interfaces:

public interface WorkerTask<K, V> {

    void init(WorkerSubpartition subpartition, WorkersConfig config);

    boolean accept(WorkerRecord<K, V> record);

    void process(WorkerRecord<K, V> record, RecordStatusObserver observer);

    void punctuate(long punctuateTime);

    void close();
}

User-defined task which is associated with one of WorkerSubpartitions. The most crucial are: accept() and process() methods. The first one checks if given WorkerRecord could be polled from internal WorkerSubpartition's queue peek and passed to process method. The second one processes just polled WorkerRecord from given WorkerSubpartition's internal queue. Processing could be done synchronously or asynchronously but in both cases one of the RecordStatusObserver's methods onSuccess() or onFailure() has to be called. Not calling any of these methods for configurable amount of time will be considered as a failure. Additionally, punctuate() method allows to do maintenance tasks every configurable amount of time independently if there are records to process or not. All the methods: accept(), process() and punctuate() are executed in a single thread sequentially so synchronization is not necessary. What is more, both methods: init() and close() are synchronized with these accept(), process() and punctuate() internally by Kafka Workers so additional user synchronization is not necessary for these calls as well.

public interface WorkerPartitioner<K, V> {

    int subpartition(ConsumerRecord<K, V> consumerRecord);

    int count(TopicPartition topicPartition);
}

User-defined partitioner is used for additional sub-partitioning which could give better distribution of processing. It means that stream of records from one TopicPartition could be reordered during processing but records with the same WorkerSubpartition remain ordered to each other. It leads also to a bit more complex offsets committing policy which is provided by Kafka Workers to ensure at-least-once delivery.

Usage example:


    Properties properties = new Properties();
    properties.setProperty("consumer.topics", "my-topic");
    properties.setProperty("consumer.kafka.bootstrap.servers", "localhost:9192");
    properties.setProperty("consumer.kafka.group.id", "my-workers");
    properties.setProperty("consumer.kafka.key.deserializer", "org.apache.kafka.common.serialization.BytesDeserializer");
    properties.setProperty("consumer.kafka.value.deserializer", "org.apache.kafka.common.serialization.BytesDeserializer");

    KafkaWorkers<String, String> kafkaWorkers = new KafkaWorkers<>(
        new WorkersConfig(properties),
        new MyWorkerTaskFactory<>(),
        new MyWorkerPartitioner<>(),
        new MyShutdownCallback());

    Runtime.getRuntime().addShutdownHook(new Thread(kafkaWorkers::shutdown));
    kafkaWorkers.start();
 }

Internals

Internally one Kafka Workers instance launches one consumer thread, one punctuator thread and configurable count of worker threads. Each thread can execute one or more WorkerTasks and each WorkerTask processes WorkerRecords from internal queue associated with given WorkerSubpartition. Kafka Workers ensures by its offsets state that only continuously processed offsets are commited.

Kafka Workers Architecture

Configuration

Name Description Type Default
consumer.topics A list of kafka topics read by ConsumerThread. list
consumer.commit.interval.ms The frequency in milliseconds that the processed offsets are committed to Kafka. long 10000
consumer.processing.timeout.ms The timeout in milliseconds for record to be successfully processed. long 300000
consumer.poll.timeout.ms The time in milliseconds spent waiting in poll if data is not available in the buffer. If 0, returns immediately with any records that are available currently in the buffer, else returns empty. long 1000
consumer.commit.retries The number of retries in case of retriable commit failed exception. int 3
consumer.kafka Should be used as a prefix for internal kafka consumer configuration. Usage example: consumer.kafka.bootstrap.servers = localhost:9192 consumer.kafka.group.id = my-workers consumer.kafka.key.deserializer = org.apache.kafka.common.serialization.BytesDeserializer consumer.kafka.value.deserializer = org.apache.kafka.common.serialization.BytesDeserializer
worker.threads.num The number of WorkerThreads per one Kafka Workers instance. int 1
worker.sleep.ms The time in milliseconds to wait for WorkerThread in case of not accepted tasks. long 1000
worker.processing.guarantee Specifies worker processing guarantees. Possible values:
  • none - logs and skips records which cause processing failure, thus failures don't cause message retransmission and may result in data loss.
  • at_least_once - shuts Kafka Workers down on record processing failure, enforces message retransmission upon restart and may cause data duplication.
String at_least_once
worker.task Could be used as a prefix for internal task configuration.
punctuator.interval.ms The frequency in milliseconds that punctuate method is called. long 1000
queue.total.size.heap.ratio It defines how big part of the heap can be used for input queues (0.5 size of the heap by default). This total memory size for all queues is divided into individual queue sizes. E.g. for 8G heap and 0.5 ratio there will be 4G for all queues. If there are 32 subpartitions each of them will get 128M input queue. Input record sizes are calculated by record.weigher class. double from (0, 1) range 0.5
record.weigher Record weigher class implementing com.rtbhouse.kafka.workers.api.record.weigher.RecordWeigher interface. It measures size in bytes of input records and is used to compute sizes of the input queues. When a size limit for some queue is exceeded then the corresponding kafka partition is paused. class com.rtbhouse.kafka.workers.api.record.weigher.SimpleRecordWeigher
queue.resume.ratio The minimum ratio of used to total queue size for partition resuming. double 0.9
metric.reporters A list of classes to use as metrics reporters. Implementing the org.apache.kafka.common.metrics.MetricsReporter interface allows plugging in classes that will be notified of new metric creation. The JmxReporter is always included to register JMX statistics. list ""

Use cases

At RTB House we use Kafka Workers for all components in our processing infrastructure. For more details please check out our techblog posts:

So far we have adopted Kafka Workers to all our use cases: BigQuery, HDFS, Elasticsearch, Aerospike, Postgres streaming writers and other Kafka to Kafka data flows which include merging, joining, dispatching, enriching, deduplicating, counting aggregates for streams of events. The diagram below shows high-level architecture of our current processing infrastructure:

Our real-time data processing