dbis-ilm/piglet

piglet-master
- src
  - main
    - resources
      - piglet.conf
      - logback.xml
    - scala
      - dbis
        piglet
        expr
        Ref.scala
        package.scala
        Traverser.scala
        Predicate.scala
        Expression.scala
        ArithmeticExpr.scala
        udf
        UDF.scala
        ScalaUDFParser.scala
        backends
        BackendManager.scala
        PigletREPL.scala
        Piglet.scala
        plan
        DataflowPlan.scala
        PrettyPrinter.scala
        PlanMerger.scala
        rewriting
        Extractors.scala
        Rules.scala
        internals
        WindowSupport.scala
        package.scala
        FastStrategyAdder.scala
        EmbedSupport.scala
        FilterUtils.scala
        MutingSupport.scala
        Fixers.scala
        RDF.scala
        StrategyBuilders.scala
        ProfilingSupport.scala
        MaterializationSupport.scala
        DFPSupport.scala
        Rewriter.scala
        Functions.scala
        dsl
        package.scala
        words
        ReplaceWord.scala
        ImmediateEndWord.scala
        CheckWord.scala
        MergeWord.scala
        WhenWord.scala
        builders
        MergeBuilder.scala
        ReplacementBuilder.scala
        PigOperatorBuilder.scala
        traits
        CheckWordT.scala
        BuilderT.scala
        EndWordT.scala
        RewriterDSL.scala
        rulesets
        SparkRuleset.scala
        RDFRuleset.scala
        GeneralRuleset.scala
        Ruleset.scala
        PipeNameGenerator.scala
        api
        PigletInterpreterAPI.scala
        op
        SplitInto.scala
        Delay.scala
        Pipe.scala
        Difference.scala
        SocketWrite.scala
        MacroOp.scala
        Zip.scala
        StreamOp.scala
        Visualize.scala
        Top.scala
        Display.scala
        TimingOp.scala
        Empty.scala
        RDFLoad.scala
        Matcher.scala
        PigOperator.scala
        Window.scala
        Materialize.scala
        Partition.scala
        Limit.scala
        SpatialJoin.scala
        ConstructBag.scala
        Sample.scala
        Tuplify.scala
        Accumulate.scala
        Intersection.scala
        Foreach.scala
        Describe.scala
        Join.scala
        IndexOp.scala
        Filter.scala
        WindowApply.scala
        Store.scala
        Cache.scala
        Distinct.scala
        cmd
        RegisterCmd.scala
        HdfsCmd.scala
        SetCmd.scala
        DefineCmd.scala
        EmbedCmd.scala
        CoGroup.scala
        DefineMacroCmd.scala
        SpatialFilter.scala
        Union.scala
        BGPFilter.scala
        Cross.scala
        Load.scala
        SocketRead.scala
        Grouping.scala
        OrderBy.scala
        Generate.scala
        RScript.scala
        Dump.scala
        codegen
        scala_lang
        FilterEmitter.scala
        JoinEmitter.scala
        UnionEmitter.scala
        DumpEmitter.scala
        LimitEmitter.scala
        ZipEmitter.scala
        CrossEmitter.scala
        CoGroupEmitter.scala
        DistinctEmitter.scala
        SampleEmitter.scala
        OrderByEmitter.scala
        TimingEmitter.scala
        HdfsCmdEmitter.scala
        TopEmitter.scala
        StreamOpEmitter.scala
        StoreEmitter.scala
        DifferenceEmitter.scala
        ScalaCodeGenStrategy.scala
        IntersectionEmitter.scala
        EmptyEmitter.scala
        ScalaEmitter.scala
        AccumulateEmitter.scala
        GroupingEmitter.scala
        LoadEmitter.scala
        MatcherEmitter.scala
        ForeachEmitter.scala
        flink
        FlinkHelper.scala
        FlinkStreamingCodeGenStrategy.scala
        FlinkCodeGenStrategy.scala
        emitter
        JoinEmitter.scala
        StreamFilterEmitter.scala
        DumpEmitter.scala
        LimitEmitter.scala
        StreamWindowEmitter.scala
        SocketReadEmitter.scala
        StreamCrossEmitter.scala
        StreamDumpEmitter.scala
        StreamLoadEmitter.scala
        StreamForeachEmitter.scala
        OrderByEmitter.scala
        StreamWindowApplyEmitter.scala
        StreamOpEmitter.scala
        StreamSampleEmitter.scala
        StoreEmitter.scala
        StreamJoinEmitter.scala
        StreamDistinctEmitter.scala
        StreamOrderByEmitter.scala
        AccumulateEmitter.scala
        StreamStoreEmitter.scala
        GroupingEmitter.scala
        LoadEmitter.scala
        StreamGroupingEmitter.scala
        SocketWriteEmitter.scala
        StreamAccumulateEmitter.scala
        PigletCompiler.scala
        CodeEmitter.scala
        spark
        SpatialIndexEmitter.scala
        SpatialFilterEmitter.scala
        SpatialJoinEmitter.scala
        StreamWindowEmitter.scala
        StreamDumpEmitter.scala
        StreamLoadEmitter.scala
        CacheEmitter.scala
        DelayEmitter.scala
        SparkStreamingCodeGenStrategy.scala
        StreamDistinctEmitter.scala
        SparkCodeGenStrategy.scala
        SpatialEmitterHelper.scala
        VisualizeEmitter.scala
        StreamOrderByEmitter.scala
        StreamStoreEmitter.scala
        PartitionerEmitter.scala
        StreamGroupingEmitter.scala
        StreamSocketReadEmitter.scala
        CodeGenerator.scala
        CodeGenContext.scala
        tools
        JarBuilder.scala
        PlanWriter.scala
        Conf.scala
        RingBuffer.scala
        UpdateMap.scala
        FileTools.scala
        ScalaCompiler.scala
        Walker.scala
        ProductTools.scala
        CliParams.scala
        CppCompiler.scala
        schema
        Schema.scala
        ComplexTypes.scala
        Types.scala
        parser
        PigParser.scala
        mm
        GlobalOperatorGraph.scala
        MaterializationManager.scala
        DataflowProfiler.scala
        MaterializationPoint.scala
        StatServer.scala
        CacheEntry.scala
        CacheManager.scala
        Eviction.scala
  - it
    - resources
      - json.pig
      - windowJoin.pig
      - union.pig
      - selfjoin.pig
      - two_joins.pig
      - skyline.data
      - groupall.pig
      - cross.pig
      - truth
        splitX.data
        sampling.data
        simple-matrix-res.data
        joined.data
        sorted.data
        unique.data
        accumulate.data
        filtered.data
        rdf_pathjoin_plain.data
        macro1.data
        joined_ambiguous_fieldnames.data
        bag.data
        construct.data
        joined_filtered.data
        aggregate.data
        bgpfilter.data
        cluster.data
        grouping2.data
        marycount.data
        crossmany.csv
        aggregate2.data
        rdf.data
        grouping.data
        aggrwogrouping.data
        result2.data
        rdf_starjoin_plain.data
        cross2.csv
        result1.data
        groupall.data
        distances.data
        nested.data
        top.data
        splitY.data
        splitZ.data
        united.data
        embedded.data
        groupedrdf.data
        cross3.csv
        sorted_multiple_directions.data
        twojoins.data
        spatialjoin.data
        spatialfilter.data
        jdbc-data.data
        result3.data
        crossed.data
      - spatialfilterwithindex.pig
      - aggrwogrouping.pig
      - windowGrouping.pig
      - spatialpartitioning.pig
      - macro1.pig
      - windowFilter.pig
      - splitInto.pig
      - spatialjoin.pig
      - grouping2.pig
      - simple_matrix.pig
      - aggregate.pig
      - jdbc.pig
      - selfjoin_ambiguous_fieldnames.pig
      - skyline.pig
      - streaming
        union.pig
        aggregate.pig
        accumulate.pig
        construct.pig
      - spatialjoinwithindex.pig
      - crossmany.pig
      - rdf_starjoin_plain.pig
      - top_schema.pig
      - foreach1.pig
      - stream_foreach1.pig
      - socket_read.pig
      - grouping.pig
      - wordcount.pig
      - groupforeach.pig
      - bgpfilter.pig
      - filter.pig
      - rdf_pathjoin_plain.pig
      - import2.pig
      - nforeach.pig
      - embedded.pig
      - accumulate.pig
      - load3.pig
      - rscript.pig
      - json.data
      - construct.pig
      - nforeach2.pig
      - stream_load2.pig
      - stream_filter.pig
      - spatialfilter.pig
      - bag.pig
      - load2.pig
      - windowCount.pig
      - windowNforeach.pig
      - import1.pig
      - load.pig
      - windowCross.pig
      - stream_load.pig
      - windowSort.pig
      - selfjoin_filtered.pig
      - sort.pig
      - socket_write.pig
      - input
        aggregate.csv
        file.csv
        grouping.txt
        events.csv
        file.json
        unsorted.csv
        sibdataset.nt
        test.mv.db
        split.csv
        cluster-data.csv
        joinInput.csv
        mary.txt
        unsorted_top.csv
        file.txt
        duplicates.csv
        construct.csv
        rdf-data.nt
        rdf-data-grouped.data
        nested.csv
        matrix_data.csv
      - sort_multiple_directions.pig
      - windowDistinct.pig
      - top.pig
      - sampling.pig
    - scala
      - dbis
        test
        CompileIt.scala
        flink
        FlinkCompileIt.scala
        FlinksCompileIt.scala
        spark
        SparksCompileIt.scala
        SparkCompileIt.scala
  - test
    - resources
      - piglet.conf
      - logback-test.xml
      - cluster-data.csv
    - scala
      - dbis
        piglet
        CompilerSpec.scala
        MaterializeTest.scala
        udf
        UDFSpec.scala
        SchemaSpec.scala
        HDFSSpec.scala
        RIntegrationSpec.scala
        DataflowPlanSpec.scala
        op
        PigOperatorSpec.scala
        PigParserSpec.scala
        codegen
        flink
        FlinkStreamingCompileSpec.scala
        FlinkCompileSpec.scala
        spark
        SparkCompileSpec.scala
        SparkStreamingCompileSpec.scala
        ExprSpec.scala
        tools
        RingBufferSpec.scala
        TestTools.scala
        CodeMatchers.scala
        CodeMatcherSpec.scala
        RDFSpec.scala
        RewriterSpec.scala
        TypeSpec.scala
        mm
        DataflowProfilerSpec.scala
        CacheManagerSpec.scala
        KnapsackSpec.scala
        LRUSpec.scala
        GlobalOperatorGraphSpec.scala
- script
  - piglet
  - statserver.py
  - simplestatserver.sh
- Language.md
- flinklib
  - src
    - main
      - resources
        application.conf
        log4j.properties
        flinks-template.stg
        flink-template.stg
      - scala
        dbis
        piglet
        backends
        flink
        FlinkConf.scala
        streaming
        FlinksConf.scala
        StreamFuncs.scala
        FlinkExtensions.scala
        SinkFuncs.scala
        SourceFuncs.scala
        UTF8StringSchema.scala
        PigFuncs.scala
        FlinkRun.scala
        Storage.scala
    - test
      - resources
        logback-test.xml
      - scala
        dbis
        flink
        test
        StorageSpec.scala
        FuncsSpec.scala
  - build.sbt
  - .gitignore
- ceplib
  - src
    - main
      - scala
        dbis
        piglet
        cep
        engines
        NextMatch.scala
        FirstMatch.scala
        CEPEngine.scala
        ContiguityMatch.scala
        AnyMatch.scala
        nfa
        NFAState.scala
        NFAStructure.scala
        RelatedValue.scala
        NFAEdge.scala
        NFAController.scala
        flink
        CustomDataSetMatcher.scala
        DataSetMatcher.scala
        CustomDataStreamMatcher.scala
        DataStreamMatcher.scala
        spark
        DStreamMatcher.scala
        CustomRDDMatcher.scala
        CustomDStreamMatcher.scala
        RDDMatcher.scala
        ops
        Outputter.scala
        Strategies.scala
        MatchCollector.scala
        EngineConf.scala
    - test
      - resources
        logback-test.xml
      - scala
        dbis
        cep
        test
        flink
        FlinkStreamingCEPTest.scala
        FlinkBatchCEPTest.scala
        spark
        SparkBatchCEPTest.scala
  - build.sbt
  - .gitignore
- LICENSE
- lib_unmanaged
  - jvmr_2.11-2.11.2.1.jar
- common
  - src
    - main
      - scala
        dbis
        piglet
        backends
        CppConfig.scala
        BackendConf.scala
        SchemaClass.scala
        PigletBackend.scala
        CommonPigFuncs.scala
        tools
        HdfsCommand.scala
        HDFSService.scala
        logging
        PigletLogging.scala
        PigletLogger.scala
  - build.sbt
  - .gitignore
- project
  - Dependencies.scala
  - build.properties
  - assembly.sbt
  - plugins.sbt
- make-distribution.sh
- Zeppelin.md
- README.md
- backends.md
- mapreducelib
  - src
    - main
      - scala
        dbis
        piglet
        backends
        mapreduce
        PigRun.scala
  - build.sbt
  - .gitignore
- Dockerfile
- sparklib
  - src
    - main
      - resources
        application.conf
        spark-template.stg
        sparks-template.stg
      - scala
        dbis
        piglet
        backends
        spark
        PigFuncs.scala
        PerfMonitor.scala
        SparkRun.scala
        SparkStream.scala
        Storage.scala
        FileStreamReader.scala
        RScriptOp.scala
        SparkSRun.scala
    - test
      - resources
        logback-test.xml
        person.csv
        values.csv
      - scala
        dbis
        piglet
        backends
        spark
        StorageSpec.scala
        Person.scala
        FuncsSpec.scala
  - build.sbt
  - .gitignore
- build.sbt
- .gitignore
- setm
  - src
    - main
      - scala
        dbis
        setm
        Main.scala
        SETM.scala
  - test
  - project
    - assembly.sbt
  - README.md
  - build.sbt
  - .gitignore
- .dockerignore
- zeppelin
  - src
    - main
      - java
        dbis
        piglet
        PigletInterpreter.java
  - build.sbt
  - .gitignore
- materialization_scripts
  - weather_high_wind.pig
  - gdelt_diff_tone.pig
  - weather_blizzard.pig
  - taxi_high_tip_block.pig
  - gdelt_gold_tone_roi.pig
  - taxi_tip_avg.pig
  - taxi_total_pick_is_drop.pig
  - taxi_dist_duration.pig
  - gdelt_url_eventcode.pig
  - weather_avg_temp_roi.pig

Pig Latin Compiler for Apache Spark / Flink

The goal of this project is to build a compiler for the Pig Latin dataflow language on modern data analytics platforms such as Apache Spark and Apache Flink. The project is not intented as a replacement or competitor of the official Pig compiler for Hadoop or its extensions such as PigSpork. Instead we have the following goals:

We want to build a compiler from scratch that compiles natively to the Scala-based Spark/Flink API and avoids all the stuff needed for MapReduce/Hadoop.
Though, we are aiming at being compatible to the original Pig compiler we plan to integrate extensiblity features allowing to define and use user-defined operators (not only UDFs) and in this way being able to integrate extensions for graph processing or machine learning.

Installation

Clone & Update

Simply clone the git project.

Build

To build the project, in the project directory invoke

sbt package

This will build the (main) Pig compiler project as well as the shipped backends. (i.e. sparklib and flinklib)

There are several test cases included which should be passed: unit tests can be executed by sbt test, integration tests which compile and execute Pig scripts on Spark or Flink are executed by sbt it:test.

Note that building the compiler requires the most recent Spark and Flink jars, but they will be downloaded by sbt automatically.

If you want to use the compiler with the frontend scripts (see below), you have to build an assembly:

sbt assembly

Usage

We provide a simple wrapper script for processing Pig scripts. Just call it with

piglet --master local[4] --backend spark your_script.pig

To run this script you have to specify the full path to the platform distribution jar in the environment variable SPARK_JAR for Spark (e.g. spark-assembly-1.5.2-hadoop2.6.0.jar) and in FLINK_JAR (e.g. flink-dist_2.11-1.0.0.jar) for Flink. For Flink you also have to provide the path to the conf directory in FLINK_CONF_DIR.

An example for Spark could look like the following:

export SPARK_JAR=/opt/spark-1.6.0/assembly/target/scala-2.11/spark-assembly-1.6.0-hadoop2.6.0.jar
piglet --master local[4] --backend spark your_script.pig

The equivalent for Flink would be:

export FLINK_JAR=/opt/flink-1.0.0/build-target/lib/flink-dist_2.11-1.0.0.jar
export FLINK_CONF_DIR=/opt/flink-1.0.0/build-target/conf
piglet --master local[4] --backend flink your_script.pig

Note, that both for Spark and Flink you need a version built for Scala 2.11 (see e.g. Spark doc and Flink doc) and the same version used for building must also be used for execution. For Flink you have to run the start script found in the bin directory (e.g. /opt/flink-1.0.0/build-target/bin/start-local.sh) before executing scripts.

The following options are supported:

--master m specifies the master (local, yarn-client, yarn)
--compile compile and build jar, but do not execute
--profiling
--outdir dir specifies the output directory for the generated code
--backend b specifies the backend to execute the script. Currently, we support
- spark (Apache Spark in batch mode)
- sparks: Apache Spark Streaming
- flink: Apache Flink in batch mode
- flinks: Apache Flink Streaming
- mapreduce: Apache Hadoop (by simply passing the script to the original Pig compiler)
--backend_dir dir
--params key=value, ...
--update-config
--show-plan Print the resulting dataflow plan
--show-stats Show execution runtimes for (some) Piglet methods
--keep Keep generated files
--sequential If more than one input script is provided, do not merge them but execute them sequentially
--log-level l
--backend-args key=value, ...

In addition, you can start an interactive Pig shell similar to Grunt:

piglet --interactive --backend spark

where Pig statements can be entered at the prompt and are executed as soon as a DUMP or STORE statement is entered. Furthermore, the schema can be printed using DESCRIBE.

Docker

Piglet can also be run as a Docker container. However, the image is not yet on DockerHub, so it has to be built manually:

sbt clean package assembly
docker build -t dbis/piglet .

Currently, the Docker image supports the Spark backend only.

To start the container, run:

docker run -it --rm --name piglet dbis/piglet

This uses the container's entrypoint which runs piglet. The above command will print the help message.

You can start the interactive mode, using -i option and enter your script.

docker run -it --rm --name piglet dbis/piglet -b spark -i

Alternatively, you can add your existing files into the container by mounting volumes and run the script in batch mode:

docker run -it --rm --name piglet -v /tmp/test.pig:/test.pig dbis/piglet -b spark /test.pig

As mentioned before, the container provides an entrypoint that executes piglet. In case you need a bash for that container, you need to overwrite the entrypoint:

docker run -it --rm --name piglet --entrypoint /bin/bash dbis/piglet

Configuration

To configure the program, we ship a configuration file. When starting the program for the first time, we will create our program home directory in your home directory and also copy the configuration file into this directory. More specifically, we will create a folder ~/.piglet (on *nix like systems) and copy the configuration file application.conf to this location.

If you update Piglet to a new version and the configuration file still exists from a previous version, a configuration exception might occur because we cannot find new configuration keys introduced by the new Piglet version in the existing config file. In such cases, you can start piglet with the -u (--update-config) option. This will force the override of your old configuration (make sure you have a backup if needed). Alternatively, you can simply remove the existing ~/.piglet/application.conf. This will also trigger the copy routine.

We use the Typesafe Config library.

Backends

As stated before, we support various backends that are used to execute the scripts. You can add your own backend by creating a jar file that contains the necessary configuration information and classes and adding it to the classpath (e.g. using the BACKEND_DIR variable).

More detailed information on how to create backends can be found in backends.md

Further Information

Details on the supported language features (statements, functions, etc.) are described here.
Documentation on how to setup integration with Zeppelin.
We use the Scala testing framework as well as the scoverage tool for test coverage. You can produce a coverage report by running sbt clean coverage test. The results can be found in target/scala-2.11/scoverage-report/index.html.