TransmogrifAI Hello World for SBT

First, Download Spark 2.4.5

Define SPARK_HOME environment variable

export SPARK_HOME=your_spark_home_dir

Run TitanicSimple:

./sbt "sparkSubmit \
    --class com.salesforce.hw.OpTitanicSimple \
    -- $PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv"

Titanic model

Train

./sbt "sparkSubmit \
    --class com.salesforce.hw.titanic.OpTitanic -- \
    --run-type=train --model-location=/tmp/titanic-model \
    --read-location Passenger=$PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv"

Score

./sbt "sparkSubmit \
    --class com.salesforce.hw.titanic.OpTitanic -- \
    --run-type=score --model-location=/tmp/titanic-model \
    --read-location Passenger=$PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv \
    --write-location /tmp/titanic-scores"

Evaluate

./sbt "sparkSubmit \
    --class com.salesforce.hw.titanic.OpTitanic -- \
    --run-type evaluate \
    --model-location /tmp/titanic-model \
    --read-location Passenger=$PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv \
    --write-location /tmp/titanic-eval \
    --metrics-location /tmp/titanic-metrics"

Boston house model

Train

./sbt "sparkSubmit \
    --class com.salesforce.hw.boston.OpBoston -- \
    --run-type=train --model-location=/tmp/boston-model \
    --read-location BostonHouse=$PWD/src/main/resources/BostonDataset/housing.data"

Score

./sbt "sparkSubmit \
    --class com.salesforce.hw.boston.OpBoston -- \
    --run-type=score --model-location=/tmp/boston-model \
    --read-location BostonHouse=$PWD/src/main/resources/BostonDataset/housing.data \
    --write-location /tmp/boston-scores"

Evaluate

./sbt "sparkSubmit \
    --class com.salesforce.hw.boston.OpBoston -- \
    --run-type evaluate \
    --model-location /tmp/boston-model \
    --read-location BostonHouse=$PWD/src/main/resources/BostonDataset/housing.data \
    --write-location /tmp/boston-eval \
    --metrics-location /tmp/boston-metrics"

Iris model

Train

./sbt "sparkSubmit \
    --class com.salesforce.hw.iris.OpIris -- \
    --run-type=train --model-location=/tmp/iris-model \
    --read-location Iris=$PWD/src/main/resources/IrisDataset/iris.data"

Score

./sbt "sparkSubmit \
    --class com.salesforce.hw.iris.OpIris -- \
    --run-type=score --model-location=/tmp/iris-model \
    --read-location Iris=$PWD/src/main/resources/IrisDataset/bezdekIris.data \
    --write-location /tmp/iris-scores"

Evaluate

./sbt "sparkSubmit \
    --class com.salesforce.hw.iris.OpIris -- \
    --run-type evaluate \
    --model-location /tmp/iris-model \
    --read-location Iris=$PWD/src/main/resources/IrisDataset/bezdekIris.data \
    --write-location /tmp/iris-eval \
    --metrics-location /tmp/iris-metrics"

Data Preparation

./sbt "sparkSubmit \
    --class com.salesforce.hw.dataprep.JoinsAndAggregates -- \
    $PWD/src/main/resources/EmailDataset/Clicks.csv \
    $PWD/src/main/resources/EmailDataset/Sends.csv"

./sbt "sparkSubmit \
    --class com.salesforce.hw.dataprep.ConditionalAggregation -- \
    $PWD/src/main/resources/WebVisitsDataset/WebVisits.csv"

Verify the Results

Look for the output file(s) in the location you specified. For instance, you can use avro-tools to inspect the scores files (on mac simply run brew install avro-tools to install it).

Other than that, the best way to verify the results is to look through the logs that should have been generated during the run. It has all kinds of information about the features the processing and the model reliability.

Generate your own workflow

Experiment with adding feature changes or exploring more models in any of the provided workflows.

See how high you can get your auROC!