TensorFlowOnYARN Build Status

TensorFlow on YARN (TOY) is a toolkit to enable Hadoop users an easy way to run TensorFlow applications in distributed pattern and accomplish tasks including model management and serving inference.

Goals

Note that current project is a prototype with limitation and is still under development

Architecture

Figure1. TOY Architecture

Features

Quick Start

  1. Prepare the build environment following the instructions from https://www.tensorflow.org/install/install_sources

  2. Clone the TensorFlowOnYARN repository.

    git clone --recursive https://github.com/Intel-bigdata/TensorFlowOnYARN
  3. Build the assembly.

    cd TensorFlowOnYARN/tensorflow-parent
    mvn package -Pnative -Pdist

    tensorflow-yarn-${VERSION}.tar.gz and tensorflow-yarn-${VERSION}.zip are built out in the tensorflow-parent/tensorflow-yarn-dist/target directory. Distribute the assembly to the client node of a YARN cluster and extract.

  4. Run the between-graph mnist example.

    cd tensorflow-yarn-${VERSION}
    bin/ydl-tf launch --num_worker 2 --num_ps 2

    This will launch a YARN application, which creates a tf.train.Server instance for each task. A ClusterSpec is printed on the console such that you can submit the training script to. e.g.

    ClusterSpec: {"ps":["node1:22257","node2:22222"],"worker":["node3:22253","node2:22255"]}
    python examples/between-graph/mnist_feed.py \
     --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
     --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
     --task_index=0
    
    python examples/between-graph/mnist_feed.py \
     --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
     --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
     --task_index=1
  5. To get ClusterSpec of an existing TensorFlow cluster launched by a previous YARN application.

    bin/ydl-tf cluster --app_id <Application ID>
  6. You may also use YARN commands through ydl-tf.

    For example, to get running application list,

    bin/ydl-tf application --list

    or to kill an existing YARN application(TensorFlow cluster),

    bin/ydl-tf kill --application <Application ID>