Nemo

Build Status Quality Gate Status

A Data Processing System for Flexible Employment With Different Deployment Characteristics.

Online Documentation

Details about Nemo and its development can be found in:

Please refer to the Contribution guideline to contribute to our project.

Nemo prerequisites and setup

Prerequisites

Installing Nemo

Running Beam applications

Apache Nemo is an official runner of Apache Beam, and it can be executed from Beam, using NemoRunner, as well as directly from the Nemo project. The details of using NemoRunner from Beam is shown on the NemoRunner page of the Apache Beam website. Below describes how Beam applications can be run directly on Nemo.

Configurable options

Examples

## WordCount example from the Beam website (Count words from a document)
$ ./bin/run_beam.sh \
    -job_id beam_wordcount \
    -optimization_policy org.apache.nemo.compiler.optimizer.policy.DefaultPolicy \
    -user_main org.apache.nemo.examples.beam.BeamWordCount \
    -user_args "--runner=NemoRunner --inputFile=`pwd`/examples/resources/inputs/test_input_wordcount --output=`pwd`/outputs/wordcount"
$ less `pwd`/outputs/wordcount*

## MapReduce WordCount example (Count words from the Wikipedia dataset)
$ ./bin/run_beam.sh \
    -job_id mr_default \
    -executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
    -optimization_policy org.apache.nemo.compiler.optimizer.policy.DefaultPolicy \
    -user_main org.apache.nemo.examples.beam.WordCount \
    -user_args "`pwd`/examples/resources/inputs/test_input_wordcount `pwd`/outputs/wordcount"
$ less `pwd`/outputs/wordcount*

## YARN cluster example
$ ./bin/run_beam.sh \
    -deploy_mode yarn \
    -job_id mr_transient \
    -executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
    -user_main org.apache.nemo.examples.beam.WordCount \
    -optimization_policy org.apache.nemo.compiler.optimizer.policy.TransientResourcePolicy \
    -user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"

## NEXMark streaming Q0 (query0) example
$ ./bin/run_nexmark.sh \
    -job_id nexmark-Q0 \
    -executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
    -user_main org.apache.beam.sdk.nexmark.Main \
    -optimization_policy org.apache.nemo.compiler.optimizer.policy.StreamingPolicy \
    -scheduler_impl_class_name org.apache.nemo.runtime.master.scheduler.StreamingScheduler \
    -user_args "--runner=NemoRunner --streaming=true --query=0 --numEventGenerators=1"

Resource Configuration

-executor_json command line option can be used to provide a path to the JSON file that describes resource configuration for executors. Its default value is config/default.json, which initializes one of each Transient, Reserved, and Compute executor, each of which has one core and 1024MB memory.

Configurable options

Examples

[
  {
    "num": 12,
    "type": "Transient",
    "memory_mb": 1024,
    "capacity": 4
  },
  {
    "type": "Reserved",
    "memory_mb": 1024,
    "capacity": 2
  }
]

This example configuration specifies

Monitoring your job using Web UI

Please refer to the instructions at web-ui/README.md to run the frontend.

Visualizing metric on run-time

While Nemo driver is alive, it can post runtime metrics through websocket. At your frontend, add websocket endpoint

ws://<DRIVER>:10101/api/websocket

where <DRIVER> is the hostname that Nemo driver runs.

OR, you can directly run the WebUI on the driver using bin/run_webserver.sh, where it looks for the websocket on its local machine, which, by default, provides the address at

http://<DRIVER>:3333

Post-job analysis

On job completion, the Nemo driver creates metric.json at the directory specified by -dag_dir option. At your frontend, add the JSON file to do post-job analysis.

Other JSON files are for legacy Web UI, hosted here. It uses Graphviz to visualize IR DAGs and execution plans.

Examples

$ ./bin/run_beam.sh \
    -job_id als \
    -executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
    -user_main org.apache.nemo.examples.beam.AlternatingLeastSquare \
    -optimization_policy org.apache.nemo.compiler.optimizer.policy.TransientResourcePolicy \
    -dag_dir "./dag/als" \
    -user_args "`pwd`/examples/resources/inputs/test_input_als 10 3"

Options for writing metric results to databases.

Speeding up builds