parquet-flinktacular - How to use Parquet in Flink - Guide

The idea of this tutorial is to get you started as quickly as possible. Therefore I setup a Github repository. There you can find sample Maven projects which can serve you as templates for your own projects.

At the moment I provide templates for the following use cases:

  1. Parquet at Flink - using Java and Protocol Buffers schema definition
  2. Parquet at Flink - using Java and Thrift schema definition
  3. Parquet at Flink - using Java and Avro schema definition
  4. Parquet at Flink - using Scala and Protocol Buffers schema definition

Each project has two main folders: commons and flink.

In the commons folder you put your schema definition IDL file. The Maven commons/pom.xml is configured to build classes from the IDL file during compilation. This makes development more convenient, because you don't need to recompile the IDL file by hand whenever there is any minor change in your schema.

In the flink folder there are your Flink jobs which read and write Parquet.

So choose your template project, download the corresponding folder and run:

$ mvn clean install package

The more detailed tutorial can be found here :)