Cloud MLE and GCE compatible TensorFlow distributed training example

Disclaimer: This is not an official Google product.

This is an example to demonstrate how to write distributed TensorFlow code which can be used on both Cloud Machine Learning Engine and Google Compute Engine instances.

Note

This example uses deprecated APIs. You are strongly encouraged to refer to new examples instead of this one.

Products

Prerequisites

A Google Cloud Platform Account
A new Google Cloud Platform Project for this lab with billing enabled
Enable the Cloud Machine Learning Engine API from the API Manager

Do this first

In this section you will start your Google Cloud Shell and clone the application code repository to it.

Open the Cloud Console
Click the Google Cloud Shell icon in the top-right and wait for your shell to open:
List the models to verify that the command returns an empty list:
```
$ gcloud ai-platform models list
Listed 0 items.
```
Note: After you start creating models, you can see them listed by using this command.

Clone the lab repository in your cloud shell, then cd into that dir.

$ git clone https://github.com/GoogleCloudPlatform/cloudml-dist-mnist-example.git
Cloning into 'cloudml-dist-mnist-example'...
...

$ cd cloudml-dist-mnist-example

Train the model on Cloud Machine Learning

Create a bucket used for training jobs.

$ PROJECT_ID=$(gcloud config list project --format "value(core.project)")
$ BUCKET="gs://${PROJECT_ID}-ml"
$ gsutil mkdir -c regional -l us-central1 $BUCKET

Upload MNIST dataset to the training bucket.

$ ./scripts_p3/create_records.py 
$ gsutil cp /tmp/data/train.tfrecords ${BUCKET}/data/
$ gsutil cp /tmp/data/test.tfrecords ${BUCKET}/data/

Note: The dataset is stored in the TFRecords format.

Submit a training job to Cloud Machine Learning.
```
$ JOB_ID="${USER}_$(date +%Y%m%d_%H%M%S)"
$ gcloud ai-platform jobs submit training ${JOB_ID} \
    --package-path trainer \
    --module-name trainer.task_p3 \
    --staging-bucket ${BUCKET} \
    --job-dir ${BUCKET}/${JOB_ID} \
    --runtime-version 1.14 \
    --python-version 3.5 \
    --region us-central1 \
    --config config/config.yaml \
    -- \
    --data_dir ${BUCKET}/data \
    --output_dir ${BUCKET}/${JOB_ID} \
    --train_steps 10000
```
Note: JOB_ID can be arbitrary, but you can't reuse the same one.

Note: Edit config/config.yaml to change the amount of resources to be allocated for the job.

During the training, worker nodes show a training loss value (the total loss value for dataset in a single training batch) in some intervals. In addition, the master node shows a loss and accuracy for the testset about every 3 minutes.

At the end of the training, the final evaluation against the testset is shown as below. In this example, it achieved 99.3% accuracy for the testset.
```
Saving dict for global step 10008: accuracy = 0.9931, global_step = 10008, loss = 0.0315906
```
(Option) Visualize the training process with TensorBoard

After the training, the summary data is stored in ${BUCKET}/${JOB_ID} and you can visualize them with TensorBoard. First, run the following command on the CloudShell to start TensorBoard.
```
$ tensorboard --port 8080 --logdir ${BUCKET}/${JOB_ID}
```
Select 'Preview on port 8080' from Web preview menu in the top-left corner to open a new browser window:

In the new window, you can use TensorBoard to see the training summary and the visualized network graph, etc.

Deploy the trained model for predictions

Deploy the trained model and set the default version.

$ MODEL_NAME=MNIST
$ ORIGIN=$(gsutil ls ${BUCKET}/${JOB_ID}/export/exporter | tail -1)
$ gcloud ai-platform models create ${MODEL_NAME} --regions us-central1
$ VERSION_NAME=v1
$ gcloud ai-platform versions create \
  --origin ${ORIGIN} \
  --model ${MODEL_NAME} \
  ${VERSION_NAME}
$ gcloud ai-platform versions set-default --model ${MODEL_NAME} ${VERSION_NAME}

Note: MODEL_NAME ane VERSION_NAME can be arbitrary, but you can't reuse the same one. It may take a few minutes for the deployed model to become ready. Until it becomes ready, it returns a 503 error against requests.

Create a JSON request file.
```
$ ./scripts_p3/make_request.py
```
This creates a JSON file request.json containing 10 test data for predictions. Each line contains a MNIST image and a sequential key value.

Submot an online prediction request.

$ gcloud ai-platform predict --model ${MODEL_NAME} --json-instances request.json
CLASSES  PROBABILITIES
7        [3.437006127094938e-21, 5.562060376991084e-16, 2.5538862785511466e-19, 7.567420805782991e-17, 2.891652426709158e-16, 2.2750016241705544e-20, 1.837758172149778e-24, 1.0, 6.893573298530907e-19, 8.065571390565747e-15]
2        [1.2471907477623206e-23, 2.291396136267388e-25, 1.0, 1.294716955176118e-32, 3.952643278911311e-25, 3.526924652059716e-36, 3.607279481567486e-25, 1.8093850397574458e-30, 7.008172489249426e-26, 2.6986217649454554e-29]
1        [5.45423318868473e-15, 1.0, 9.504595027687301e-12, 6.393277101537779e-16, 2.4266970655162368e-09, 1.674065400192326e-11, 5.571797797448985e-12, 1.3436474155525957e-10, 9.206201773137579e-11, 1.1845845552843799e-14]
0        [1.0, 5.909986790816939e-19, 2.4125963203678984e-14, 6.702774370873354e-18, 1.6427204513119395e-14, 1.2845496049501432e-15, 1.942619459391337e-12, 1.5014800112869015e-13, 4.630940880876074e-16, 7.722024408796102e-12]
4        [1.683408525966247e-15, 3.426583061416047e-15, 2.352944702683872e-14, 1.4354134652804144e-17, 1.0, 4.901479816097308e-14, 1.764131375703304e-13, 4.879935744842132e-16, 3.459843631210824e-13, 1.0766989513508185e-12]
1        [1.6325680202852679e-15, 1.0, 1.65708375840512e-13, 1.280228757537301e-17, 5.987414036789929e-11, 1.184729804494175e-13, 3.078233265616133e-14, 2.1106190406516845e-10, 2.1799059085614303e-11, 3.1170367766660944e-15]
4        [3.183327881731347e-25, 1.5531472253595656e-14, 8.822675326266091e-17, 1.1257204467406693e-23, 1.0, 3.675780366256499e-15, 2.509253215360222e-18, 1.948070958357668e-15, 7.695419301789741e-10, 1.5495283617957163e-14]
9        [7.252868244881265e-20, 1.0792899474931612e-19, 3.0990492379377613e-16, 1.9092543500607597e-16, 3.1547468630488407e-12, 3.947477936670459e-16, 7.344393142182707e-25, 1.0551019585358889e-17, 7.17420805922131e-13, 1.0]
5        [6.770833673326029e-15, 2.2513067020667885e-17, 4.9216548513263305e-19, 1.441009507560458e-17, 4.733751198914959e-15, 0.9999998807907104, 6.701989718749246e-08, 5.521230514480798e-19, 5.276084458216701e-09, 4.1348588442069456e-10]
9        [5.124952379488745e-22, 1.917571388490136e-20, 2.02434602684524e-21, 2.1246177460406675e-18, 1.8790316524963657e-11, 2.7904309518969085e-14, 7.973171243464317e-26, 6.233734909559877e-14, 9.224547341257772e-12, 1.0]

CLASSES is the most probable digit of the given image, and PROBABILITIES shows the probabilities of each digit.

Using online prediction from Datalab

You can use the Datalab notebook to demonstrate the online prediction feature in an interactive manner.

Launch Datalab from the Cloud Shell.

$ datalab create mydatalab --zone us-central1-a
...
Click on the *Web Preview* (up-arrow button at top-left), select *port 8081*, and start using Datalab.

Select 'Preview on port 8081' from Web preview menu in the top-left corner to open a Datalab window.

Open a new notebook and execute the following command.

!git clone https://github.com/GoogleCloudPlatform/cloudml-dist-mnist-example

Go back to the notebook list window and open Online prediction example.ipynb in cloudml-dist-mnist-example/notebooks folder.
Follow the instruction in the notebook.

(Option) Training on VM instances

Optionally, you can train the model using VM instances running on Google Compute Engine(GCE).

Launch four VM instances with the following options.
- Hostname: ps-0, master-0, worker-0, worker-1
- OS image: ubuntu-1604-lts
- Machine type: n1-standard-1
- Identity and API access: Set access for each API, Storage = 'Read Write'
Note: Since instance roles are inferred from their hostnames, you must set hostnames exactly as specified.

Install TensorFlow

Open ssh terminal and run the following commands on all instances.

$ sudo apt-get update
$ sudo apt-get -y upgrade
$ sudo apt-get install -y python3-pip
$ sudo pip3 install tensorflow==1.14.0 gast==0.2.2

Create a bucket used for training jobs and upload MNIST dataset.

Run the following commands on the CloudShell.

$ git clone https://github.com/GoogleCloudPlatform/cloudml-dist-mnist-example.git
$ cd cloudml-dist-mnist-example
$ PROJECT_ID=$(gcloud config list project --format "value(core.project)")
$ BUCKET="gs://${PROJECT_ID}-ml"
$ gsutil mkdir $BUCKET

$ ./scripts_p3/create_records.py 
$ gsutil cp /tmp/data/train.tfrecords ${BUCKET}/data/
$ gsutil cp /tmp/data/test.tfrecords ${BUCKET}/data/

Start training

Run the following commands on the CloudShell. It distributes executable files to all instances and start a training job.
```
$ gcloud config set compute/zone us-east1-c
$ ./scripts_p3/start-training.sh $BUCKET
```
Note: us-east1-c should be the zone of instances you have created.

When the training is finished, the storage path containing the model binary will be displayed as below.
```
Trained model is stored in gs://cloudml-sample-ml/job_170415_081023/export/exporter/1492245350557/
```
Clean up

Clean up is really easy, but also super important: if you don't follow these instructions, you will continue to be billed for the project you created.

To clean up, navigate to the Google Developers Console Project List, choose the project you created for this lab, and delete it. That's it.