Project: comet-data-pipeline (GitHub Link)

comet-data-pipeline-master
- .github
  - PULL_REQUEST_TEMPLATE
    - pull_request_template.md
  - pull_request_template.md
  - ISSUE_TEMPLATE
    - feature_request.md
    - documentation.md
    - bug_report.md
- src
  - main
    - resources
      - dags
        comet_watch.py
        comet_import.py
        __init__.py
        comet_index.py
        comet_ingest.py
        comet_bqload.py
      - SCHEMA-VALID-NOHEADER.dsv
      - fs
        reference.conf
      - gcp
        reference.conf
        core-site.xml
      - reference.conf
      - SCHEMA-VALID.dsv
      - azure
        reference.conf
      - hdfs
        reference.conf
        core-site.xml
      - __init__.py
      - atlas-application.properties
      - atlas-model
        scripts.sh
        4010-common_typedefs.json
    - scala
      - com
        ebiznext
        comet
        database
        extractor
        ExtractScriptGenConfig.scala
        TemplateParams.scala
        ExtractScriptGen.scala
        utils
        FileLock.scala
        Utils.scala
        conversion
        package.scala
        BigQueryUtils.scala
        SparkJob.scala
        CometObjectMapper.scala
        CliConfig.scala
        Unpacker.scala
        Formatter.scala
        Version.scala
        DataTypeEx.scala
        TextSubstitutionEngine.scala
        CometJacksonModule.scala
        workflow
        IngestionWorkflow.scala
        job
        index
        bqload
        BigQueryLoadJob.scala
        BigQueryLoadConfig.scala
        jdbcload
        JdbcLoadJob.scala
        JdbcLoadConfig.scala
        esload
        ESLoadConfig.scala
        ESLoadJob.scala
        atlas
        AtlasJob.scala
        AtlasConfig.scala
        Main.scala
        convert
        Parquet2CSVConfig.scala
        Parquet2CSV.scala
        ingest
        MetricRecord.scala
        SimpleJsonIngestionJob.scala
        RejectedRecord.scala
        JsonIngestionJob.scala
        ChewerJob.scala
        DsvIngestionJob.scala
        IngestConfig.scala
        IngestionJob.scala
        AuditLog.scala
        PositionIngestionJob.scala
        metrics
        Metrics.scala
        MetricsJob.scala
        MetricsConfig.scala
        infer
        InferSchemaJob.scala
        InferSchemaConfig.scala
        transform
        AutoTask.scala
        config
        DatasetArea.scala
        UdfRegistration.scala
        Settings.scala
        SparkEnv.scala
        privacy
        PrivacyEngine.scala
        schema
        handlers
        package.scala
        InferSchemaHandler.scala
        LaunchHandler.scala
        SchemaHandler.scala
        StorageHandler.scala
        generator
        SchemaGenConfig.scala
        YamlSerializer.scala
        SchemaGen.scala
        XlsReader.scala
        model
        Schema.scala
        Position.scala
        IndexMapping.scala
        UserType.scala
        package.scala
        Format.scala
        atlas
        AtlasData.scala
        AtlasModel.scala
        Partition.scala
        CometStructField.scala
        Metadata.scala
        Stage.scala
        Mode.scala
        IndexSink.scala
        WriteMode.scala
        Types.scala
        Attribute.scala
        RowLevelSecurity.scala
        AutoJobDesc.scala
        Domain.scala
        Trim.scala
        PrivacyLevel.scala
        PrimitiveType.scala
        Rejection.scala
        MetricType.scala
      - org
        apache
        spark
        sql
        execution
        datasources
        json
        JsonIngestionUtil.scala
  - test
    - scala-2.12
      - com
        ebiznext
        comet
        TestHelperAux.scala
    - resources
      - iris.csv
      - sample
        merge
        merge.yml
        merge.dsv
        default.yml
        database
        expected_script_payload.txt
        EXTRACT_TABLE.sql.mustache
        SCHEMA-VALID-NOHEADER.dsv
        mapping
        dataset
        part-00000-3f05c2f0-977f-4beb-8520-faaef3a7427a.snappy.parquet
        _SUCCESS
        types.yml
        position
        position.yml
        types.yml
        XPOSTBL
        metadata
        business
        business.yml
        Players.csv
        types.yml
        SCHEMA-VALID.dsv
        dream
        OneClient_Segmentation_20190101_090800_008.psv
        dream.yml
        OneClient_Contact_20190101_090800_008.psv
        simple-json-locations
        types.yml
        locations.yml
        locations.json
        SCHEMA-VALID.dsv.tgz
        DOMAIN.yml
        COMPLEX_DOMAIN.yml
        SomeDomainTemplate.xls
        json
        types.yml
        complex.json
        json.yml
        SCHEMA-VALID.dsv.gz
        Players-merge.csv
        yelp
        business.json
        yelp.yml
      - log4j.properties
      - application-test.conf
      - quickstart
        incoming
        hr
        dummy.json
        sellers-2018-01-01.json
        dummy.ack
        sellers-2018-01-01.ack
        locations-2018-01-01.json
        locations-2018-01-01.ack
        sales
        customers-2018-01-01.ack
        orders-2018-01-01.ack
        orders-2018-01-01.csv
        customers-2018-01-01.psv
        metadata
        business
        business.yml
        types
        default.yml
        types.yml
        domains
        hr.yml
        sales.yml
      - applications.properties
      - expected
        datasets
        accepted
        dream
        segment.json
        client.json
        DOMAIN
        graduateProgram.json
        User.json
        locations
        locations.json
        rejected
        DOMAIN.json
        yml
        graduateProgram.yml
        types_211.yml
        position_serialization_212.yml
        business.yml
        position_serialization_211.yml
        types_212.yml
        domain.yml
        user.yml
    - scala-2.11
      - com
        ebiznext
        comet
        TestHelperAux.scala
    - scala
      - com
        ebiznext
        comet
        database
        extractor
        ExtractScriptGenSpec.scala
        TemplateParamsSpec.scala
        utils
        UtilsSpec.scala
        conversion
        BigQueryUtilsSpec.scala
        EmbeddedElasticsearchServer.scala
        udf
        TestUdf.scala
        JdbcChecks.scala
        workflow
        IngestionWorkflowSpec.scala
        TestHelper.scala
        job
        index
        bqload
        BiqQueryLoadJobSpec.scala
        jdbcload
        JdbcLoadJobSpec.scala
        esload
        ESLoadJobSpec.scala
        atlas
        AtlasConfigSpec.scala
        convert
        Parquet2CSVSpec.scala
        ingest
        IngestJobSpec.scala
        metrics
        MetricsJobSpec.scala
        infer
        InferSchemaSpec.scala
        config
        ConfigSpec.scala
        privacy
        PrivacyEngineSpec.scala
        schema
        handlers
        JsonIngestionJobSpec.scala
        SchemaHandlerSpec.scala
        PositionIngestionJobSpec.scala
        InferSchemaJobSpec.scala
        StorageHandlerSpec.scala
        JsonIngestionParsingSpec.scala
        InferSchemaHandlerSpec.scala
        AutoJobHandlerSpec.scala
        generator
        SchemaGenSpec.scala
        model
        SchemaSpec.scala
        TimestampTypeSpec.scala
        TypesSpec.scala
      - org
        apache
        spark
        sql
        execution
        datasources
        json
        JsonIngestionUtilSpec.scala
- sbt-gcp.sh
- readthedocs.yml
- LICENSE
- project
  - Dependencies.scala
  - build.properties
  - Definition.scala
  - Common.scala
  - plugins.sbt
- CONTRIBUTING.md
- quickstart.sh
- .travis.yml
- README.md
- .scalafmt.conf
- CODE_OF_CONDUCT.md
- cloudbuild.yaml
- build.sbt
- .gitignore
- docs
  - Makefile
  - make.bat
  - user
    - introduction.rst
    - configure.rst
    - sample.rst
    - quickstart.rst
    - complete.rst
  - developer
    - contribute.rst
  - requirements.txt
  - index.rst
  - conf.py
- version.sbt

About Comet Data Pipeline

Complete documentation available here

Introduction

The purpose of this project is to efficiently ingest various data sources in different formats and make them available for analytics. Usualluy, ingestion is done by writing hand made custom parsers that transform input files into datasets of records.

This project aims at automating this parsing task by making data ingestion purely declarative.

The workflow below is a typical use case :

Export your data as a set of DSV (Delimiter-separated values) or JSON files
Define each DSV/JSON file with a schema using YAML syntax
Configure the ingestion process
Start watching your data being available as Hive Tables in your datalake

The main advantages of the Comet Data Pipeline project are :

Eliminates manual coding for data ingestion
Assign metadata to each dataset
Expose data ingestion metrics and history
Transform text files to strongly typed records
Support semantic types
Force privacy on specific fields (RGPD)
very, very simple piece of software to administer

How it works

Comet Data Pipeline automates the loading and parsing of files and their ingestion into a Hadoop Datalake where datasets become available as Hive tables.

Complete Comet Data pipeline

Landing Area : Files are first stored in the local file system
Staging Area : Files associated with a schema are imported into the datalake
Working Area : Staged Files are parsed against their schema and records are rejected or accepted and made available in parquet/orc/... files as Hive Tables.
Business Area : Tables in the working area may be joined to provide a hoslictic view of the data through the definition of AutoJob.
Data visualization : parquet/orc/... tables may be exposed in datawarehouses or elasticsearch indexes