A Scalable Implementation of Deep Learning on Spark

This library is based on the implementation of artificial neural networks in Spark ML. In addition to the multilayer perceptron, it contains new Spark deep learning features that were not yet merged to Spark ML. Currently, they are Stacked Autoencoder and tensor data flow. Highlights of the library:

Provides Spark ML pipeline API
Implements data parallel training
Supports native CPU BLAS
Employs tensor data flow
Provides extensible API for developers of new features

Installation

Requirements

Apache Spark 2.0 or higher
Java and Scala
Maven

Build

Clone and compile:

git clone https://github.com/avulanov/scalable-deeplearning.git
cd scalable-deeplearning
sbt assembly (or mvn assembly)

The jar library will be availabe in target folder. assembly includes optimized numerical processing library netlib-java. Optionally, one can build package.

Performance configuration

Scaladl uses netlib-java library for optimized numerical processing with native BLAS. All netlib-java classes are included in scaladl.jar. The latter has to be in the classpath before Spark's own libraries because Spark has a subset of netlib. In order to do this, set spark.driver.userClassPathFirst to true in spark-defaults.conf.

If native BLAS libraries are not available at runtime or scaladl is not the first in the classpath, you will see a warning WARN BLAS: Failed to load implementation from: and reference or pure JVM implementation will be used. Native BLAS library such as OpenBLAS (libopenblas.so or dll) or ATLAS (libatlas.so) should be in the path of all nodes that run Spark. Netlib-java requires the library to be named as libblas.so.3, and one has to create a symlink. The same is for Windows and libblas3.dll. Below are the setup details for different platforms. With proper configuration, you will see an info INFO JniLoader: successfully loaded ...netlib-native_system-....

Linux:

Install native blas library (depending on your distributive):

yum install openblas <OR> apt-get openblas <OR> download and compile OpenBLAS

Create symlink to native BLAS within its folder /your/blas

ln -s libopenblas.so libblas.so.3

Add it to your library path. Make sure there is no other folder with libblas.so.3 in your path.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/your/blas

Windows:

Copy the following dlls from MINGW distribution and from OpenBLAS to the folder blas. Make sure they are all the same 64 or 32 bit. Add that folder to your path variable.

libquadmath-0.dll // MINGW
libgcc_s_seh-1.dll // MINGW
libgfortran-3.dll // MINGW
libopeblas.dll // OpenBLAS binary
liblapack3.dll // copy of libopeblas.dll
libblas3.dll // copy of libopenblas.dll

Example of use

Built-in examples

Scaldl provides working examples of MNIST classification and pre-training with stacked autoencoder. Examples are in scaladl.examples package. They can be run via Spark submit:

./spark-submit --class scaladl.examples.MnistClassification --master spark://master:7077 /path/to/scaldl.jar /path/to/mnist-libsvm

Spark shell

Start Spark with this library:

./spark-shell --jars scaladl.jar

Or use it as external dependency for your application.

Multilayer perceptron

MNIST classification

Load MNIST handwritten recognition data stored in LIBSVM format as a DataFrame
Initialize the multilayer perceptron classifier with 784 inputs, 32 neurons in hidden layer and 10 outputs
Train and predict

import org.apache.spark.ml.scaladl.MultilayerPerceptronClassifier
val train = spark.read.format("libsvm").option("numFeatures", 784).load("mnist.scale").persist()
val test = spark.read.format("libsvm").option("numFeatures", 784).load("mnist.scale.t").persist()
train.count() // materialize data lazy persisted in memory
test.count() // materialize data lazy persisted in memory
val trainer = new MultilayerPerceptronClassifier().setLayers(Array(784, 32, 10)).setMaxIter(100)
val model = trainer.fit(train)
val result = model.transform(test)

Stacked Autoencoder

Pre-training

Load MNIST data
Initialize the stacked autoencoder with 784 inputs and 32 neurons in hidden layer
Train stacked autoencoder

Initialize the multilayer perceptron classifier with 784 inputs, 32 neurons in hidden layer and

import org.apache.spark.ml.scaladl.{MultilayerPerceptronClassifier, StackedAutoencoder}
val train = spark.read.format("libsvm").option("numFeatures", 784).load(mnistTrain).persist()
train.count()
val stackedAutoencoder = new StackedAutoencoder().setLayers(Array(784, 32))
.setInputCol("features")
.setOutputCol("output")
.setDataIn01Interval(true)
.setBuildDecoder(false)
val saModel = stackedAutoencoder.fit(train)
val autoWeights = saModel.encoderWeights
val trainer = new MultilayerPerceptronClassifier().setLayers(Array(784, 32, 10)).setMaxIter(1)
val initialWeights = trainer.fit(train).weights
System.arraycopy(autoWeights.toArray, 0, initialWeights.toArray, 0, autoWeights.toArray.length)
trainer.setInitialWeights(initialWeights).setMaxIter(10)
val model = trainer.fit(train)

Contributions

Contributions are welcome, in particular in the following areas:

New layers
- Convolutional
- ReLu
Flexibility
- Implement the reader of Caffe/other deep learning configuration format
- Implement Python/R/Java interface
Efficiency
- Switch from double to single precision
- Implement wrapper to specialized deep learning libraries, e.g. TensorFlow
Refactoring
- Implement own version of L-BFGS to remove dependency on breeze