ETL Light

Light and effective Extract-Transform-Load job based on Apache Spark, tailored for use-cases where the source is Kafka and the sink is hadoop file system implementation such as HDFS, Amazon S3 or local FS (useful for testing).

Features:

Overview

Alt text

A single job goes through the following steps:

Pipeline, Writers and Transformers:

A Pipeline (yamrcraft.etlite.pipeline.Pipeline) is a processor of individual event as it was extracted from Kafka source (given as a raw byte array type). it is a composition of a Transformer and Writer instances where the output of the Transformer is the input of the Writer.

A Pipeline instance is created by an implementation of yamrcraft.etlite.pipeline.PipelineFactory.

A PipelineFactory class must be configured in the provided job' configuration file (application.conf).

There are several implementations of the PipelineFactory trait:

Different compositions of Transformers and Writers can be built into Pipelines as needed.

Configuration

See configuration examples under: etlite/src/main/resources

spark.conf: properties used to directly configure Spark settings, passed to SparkConf upon construction.

kafka.topics: a list of kafka topics to be consumed by this job run.

kafka.conf: properties used to directly configure Kafka settings, passed to KafkaUtils.createRDD(...).

etl.lock: can be used to set a distributed lock in order to prevent the same job from running concurrently avoiding possible contention/corruption of data.

etl.state: sets the destination folder that holds the state files and the number of last state files to keep.

etl.errors-folder: location of error folder to hold failed processed events (e.g. parse errors).

etl.pipeline: defines a pair of transformer and writer used together to create a processing pipeline for processing a single Kafka topic partition.

etl.pipeline.transformer.config: transformer configurations.

etl.pipeline.writer.config: writer configurations.

Build

Generates an uber jar ready to be used for running as a Spark job (specific use cases might require extension):

$ sbt etlite/assembly  

Test

Integration (black-box) tests can be found under 'etlite' project: src/it.

Each integration test generally goes through these steps:

  1. docker-compose up - starting zookeeper, kafka and spark containers.
  2. ingests events that are relevant for the specific test case into kafka (e.g. json, protobuf).
  3. runs ETL spark job inside the spark container, the job to run is determined by the provided configuration file (application.conf).
  4. asserts ETL job run artifacts.
  5. docker-compose down.

To run integration tests first create etlite uber jar and then run the tests:

$ sbt etlite/assembly           # creates etlite assembly jar
$ sbt proto-messages/assembly   # creates an external protobuf messages jar used by the protobuf integration test

$ sbt etlite/it:test            # runs all integration tests under src/it using dockers for kafka/spark

Run

Copy generated 'etlite' assembly jar and your relevant application.conf file into target spark machine and place in the following commands:

(*) Note: to read configuration file from local file system, prefix path with: "file:///"

$ spark-submit <assembly jar> <application.conf>

Run in yarn client mode (running driver locally):

$ spark-submit --master yarn-client <assembly jar> <application.conf>

Run in yarn-cluster mode (running driver in yarn application master):

$ spark-submit --master yarn-cluster <assembly jar> <application.conf>

Examples

Protobuf to Parquet ETL

Off the shelf support for protobuf events transformation and loading - only requires custom configuration and inclusion of the referenced protobuf events jar in spark classpath.

(*) assuming each kafka topic holds messages of a single protobuf schema, which is a good practice to have anyway.

In order to run a job that reads protobuf serialized messages from Kafka you will need to do the following:

  1. set the relevant configuration entries in application.conf file as follows:

1.1 set the pipline factory class to be used, in this case protobuf pipeline class:

factory-class = "yamrcraft.etlite.pipeline.GenericProtoPipelineFactory"

1.2 set the mapping between your topic name to the protobuf event class fully qualified name, in this example the topic 'events' is expected to have protobuf messages that can be desrialized into an object of class 'examples.protobuf.UserOuterClass$User' which must exist in the classpath (see next how to configure it).

1.3 set the timestamp-field to specify a field in the protobuf event of a protobuf.Timestamp type, it is later used by the writer to resolve the storage partition.

transformer = {
      config = {
        timestamp-field = "time"
        topic-to-proto-class {
          "events" = "examples.protobuf.UserOuterClass$User"
        }
      }
    }
  1. run the job and add your external protobuf messages jar (the jar that contains your protobuf generated classes) to spark classpath, for example:

    $ spark-submit --jar=<path to external protobuf messages> <assembly jar> <application.conf>

See running example under 'etlite/src/it/scala/yamrcraft/etlite/proto/ProtobufETLIntegrationTest'.