MDS (acronym of Multiple Dimension Spread)

Introduction

What does this project do?

MDS (acronym of Multiple Dimension Spread) is a Schema-less columnar storage format. Provide flexible representation like JSON and efficient reading similar to other columnar storage formats.

Why is this project useful?

There was a problem that it is too large to compress and save the data as it is in the Big Data era. From the demand for improvement in compression ratio and read performance, several columnar data formats (for example, Apache ORC and Apache Parquet) were proposed. They achieve the high compression ratio from similar data in column and reading performance for grouping data by column when data is used.

However, these data formats are required the data structure in a row (or a record) should be defined before saving the data. It was necessary to decide how to use it at the time of data storage, and it was often a problem that it was difficult to decide what kind of data to use.

In this project, we provide a new columnar format which does not require the schema at the time of data storage with compression and read performance equal to (or higher in case) than other formats.

Use cases

Data Analysis

Analyzing big data requires store data compactly and get data smoothly. MDS as a columnar format is useful for this needs.

Data Lake

Data Lake is a data pool that is not required the data structure (as a schema) in the row at the time of data storage. And stored data can be used with defining its schema at the time of analyzing. See DataLake.

How do I get started?

Firstly, please get MDS related repositories following section named "How to get source".

MDS format can treat data without Hadoop environment. However, it is useful for big data. so, it needs a Hadoop environment for storage and Hive for read to use efficiently.

We have a plan to create a docker environment of Hadoop and Hive for test use, but current situation, you need to prepare Hadoop and Hive firstly.

Setup environment

CLI

CLI is a Command Line Interface tool for using MDS. following tools are provided.

mds.sh needs some jars, so please create jar files before using.

$ mvn package

How to use

Preparation

For preparation, get MDS jars and store then to proper directories.

$ bin/setup.sh # get MDS jars from Maven repository (bin/setup.sh -h for help)

And, put MDS related jars to Hadoop.

$ cp -r jars/mds /tmp/mds_lib
$ hdfs dfs -put -r /tmp/mds_lib /mds_lib

Create MDS formatted file

convert JSON data to MDS format.

$ bin/mds.sh create -i src/example/src/main/resources/sample_json.txt -f json -o /tmp/sample.mds
$ bin/mds.sh cat -i /tmp/sample.mds -o '-' # show whole data
{"summary":{"total_price":550,"total_weight":412},"number":5,"price":110,"name":"apple","class":"fruits"}
{"summary":{"total_price":800,"total_weight":600},"number":10,"price":80,"name":"orange","class":"fruits"}
$ bin/mds.sh cat -i /tmp/sample.mds -o '-' -p '[ ["name"] ]' # show part of data
{"name":"apple"}
{"name":"orange"}

Copy MDS file to HDFS environment

Copy MDS file to HDFS environment.

$ hdfs dfs -mkdir -p /tmp/ss
$ hdfs dfs -put /tmp/sample.mds /tmp/ss/sample.mds

Read MDS file using Hive

Enter Hive and add jar files to use MDS format.

$ hive -i jars/mds/add_jar.hql
> create database test;
> use test;
> create external table sample_json (
    summary struct<total_price: bigint, total_weight: bigint>,
    number bigint,
    price bigint,
    name string,
    class string
  )
  ROW FORMAT SERDE
    'jp.co.yahoo.dataplatform.mds.hadoop.hive.MDSSerde'
  STORED AS INPUTFORMAT
    'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveLineInputFormat'
  OUTPUTFORMAT
    'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveParserOutputFormat'
  location '/tmp/ss';
> select * from sample_json;
{"total_price":550,"total_weight":412}  5 110 apple fruits
{"total_price":800,"total_weight":600}  10  80  orange  fruits

See document Hive for further detail to use.

Where can I get more help, if I need it?

Support and discussion of MDS are on the Mailing list. Please refer the following subsection named "How to contribute".

We plan to support and discussion of MDS on the Mailing list. However, please contact us via GitHub until ML is opened.

How to contribute

We welcome to join this project widely.

Document

See document MDS

License

This project is on the Apache License. Please treat this project under this license.

Mailing list

User support and discussion of MDS development are on the following Mailing list. Please send a blank e-mail to the following address.

Archive is useful for what was communicated at this project.

for Developer

Please accept Contributer licence agreement when participating as a developer.

We invite you to JIRA as a bug tracking, when you mentioned in the above Mailing list.

System requirement

Following environments are required.

How to get the source

MDS library constructs jar files on following modules.

GitHub

MDS sources are there.

Preparement

Install gpg and create a gpg key for maven plugin to use git clone.

gpg --gen-key
gpg --list-keys

Add following gpg setting to maven-local-repository-home/conf/settings.xml . Usually, maven-local-repository-home is $HOME/.m2 .

</profiles>
  <profile>
    <id>sign</id>
    <activation>
        <activeByDefault>true</activeByDefault>
    </activation>
    <properties>
        <gpg.passphrase>***YOUR-PASSPHRASE***</gpg.passphrase>
    </properties>
  </profile>
</profiles>

Maven

MDS sources can get from the Maven repository.

multiple-dimension-spread

dataplatform-config

dataplatform-schema-lib

Compile sources

Compile each source following instructions.

multiple-dimension-spread

$ cd /local/mds/home
$ git clone https://github.com/yahoojapan/multiple-dimension-spread.git
$ cd multiple-dimension-spread
$ mvn clean install

dataplatform-schema-lib

$ cd /local/mds/home
$ git clone https://github.com/yahoojapan/dataplatform-schema-lib.git
$ cd dataplatform-schema-lib
$ mvn clean install

dataplatform-config

$ cd /local/mds/home
$ git clone https://github.com/yahoojapan/dataplatform-config.git
$ cd dataplatform-config
$ mvn clean install

Next Reading

MISC

Change Logs

FAQ