A Scalable Implementation of Deep Learning on Spark

This library is based on the implementation of artificial neural networks in Spark ML. In addition to the multilayer perceptron, it contains new Spark deep learning features that were not yet merged to Spark ML. Currently, they are Stacked Autoencoder and tensor data flow. Highlights of the library:

Installation

Requirements

Build

Clone and compile:

git clone https://github.com/avulanov/scalable-deeplearning.git
cd scalable-deeplearning
sbt assembly (or mvn assembly)

The jar library will be availabe in target folder. assembly includes optimized numerical processing library netlib-java. Optionally, one can build package.

Performance configuration

Scaladl uses netlib-java library for optimized numerical processing with native BLAS. All netlib-java classes are included in scaladl.jar. The latter has to be in the classpath before Spark's own libraries because Spark has a subset of netlib. In order to do this, set spark.driver.userClassPathFirst to true in spark-defaults.conf.

If native BLAS libraries are not available at runtime or scaladl is not the first in the classpath, you will see a warning WARN BLAS: Failed to load implementation from: and reference or pure JVM implementation will be used. Native BLAS library such as OpenBLAS (libopenblas.so or dll) or ATLAS (libatlas.so) should be in the path of all nodes that run Spark. Netlib-java requires the library to be named as libblas.so.3, and one has to create a symlink. The same is for Windows and libblas3.dll. Below are the setup details for different platforms. With proper configuration, you will see an info INFO JniLoader: successfully loaded ...netlib-native_system-....

Linux:

Install native blas library (depending on your distributive):

yum install openblas <OR> apt-get openblas <OR> download and compile OpenBLAS

Create symlink to native BLAS within its folder /your/blas

ln -s libopenblas.so libblas.so.3

Add it to your library path. Make sure there is no other folder with libblas.so.3 in your path.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/your/blas

Windows:

Copy the following dlls from MINGW distribution and from OpenBLAS to the folder blas. Make sure they are all the same 64 or 32 bit. Add that folder to your path variable.

libquadmath-0.dll // MINGW
libgcc_s_seh-1.dll // MINGW
libgfortran-3.dll // MINGW
libopeblas.dll // OpenBLAS binary
liblapack3.dll // copy of libopeblas.dll
libblas3.dll // copy of libopenblas.dll

Example of use

Built-in examples

Scaldl provides working examples of MNIST classification and pre-training with stacked autoencoder. Examples are in scaladl.examples package. They can be run via Spark submit:

./spark-submit --class scaladl.examples.MnistClassification --master spark://master:7077 /path/to/scaldl.jar /path/to/mnist-libsvm

Spark shell

Start Spark with this library:

./spark-shell --jars scaladl.jar

Or use it as external dependency for your application.

Multilayer perceptron

MNIST classification

import org.apache.spark.ml.scaladl.MultilayerPerceptronClassifier
val train = spark.read.format("libsvm").option("numFeatures", 784).load("mnist.scale").persist()
val test = spark.read.format("libsvm").option("numFeatures", 784).load("mnist.scale.t").persist()
train.count() // materialize data lazy persisted in memory
test.count() // materialize data lazy persisted in memory
val trainer = new MultilayerPerceptronClassifier().setLayers(Array(784, 32, 10)).setMaxIter(100)
val model = trainer.fit(train)
val result = model.transform(test)

Stacked Autoencoder

Pre-training