Spark cluster deploy tools for Openstack

This project provides scripts for Apache Spark cluster autodeploy in any Openstack environment with optional useful tools:

Our tools do not need prebuilt images; you can just use vanilla ones. Supported distros are listed at the end of this page.

All the versions of Apache Spark since 1.0 are supported; you are free to choose needed versions of Spark and Hadoop.

Developed in Institute for System Programming of the Russian Academy of Sciences and distributed with Apache 2.0 license.

You are welcome to contribute.

Installation

  1. Install ansible. Last updates works well with version 2.8.2

    It looks like Openstack stuff in Ansible will only work with Python 2.7, so if Ansible is already installed for Python 3, you should use virtual environment with Python 2 or be careful with your $PATH (path for Ansible for Python2 should be the first one)

    Old versions of packages can cause problems, in that case pip --upgrade could help (e.g. for Ubuntu):

    sudo apt-get install libffi-dev libssl-dev python-dev
    pip install --upgrade pip
    pip install --upgrade six ansible shade

    Also, for some weird reson, six should be installed with easy_install instead of pip on Mac OS in some cases (issue on github)

    A sample list of all packages and their versions that works can be found in pip-list.txt. Note: it's a result of pip freeze output for virtualenv; formally speaking we depend only on Ansible, six and shade: all the other packages are their dependencies.

Configuration

  1. Download (unless already done so) -openrc.sh from your Openstack Web UI (Project > Compute > Access & Security > API Access > Download OpenStack RC File)

    If you don't want to enter password each time and don't care about security, replace

    read -sr OS_PASSWORD_INPUT
    export OS_PASSWORD=$OS_PASSWORD_INPUT

    with (replace <password> with your password)

    export OS_PASSWORD="<password>"

    WARNING - it's not secure; do not do that.

  2. Before running ./spark-openstack this file must be sourced (once per shell session):

    source /path/to/your/<project>-openrc.sh
  3. Download/upload key pair. You'll need both the name of key pair (Key Pair Name column in Access & Security > Key Pairs) and prite key file. Make sure that only user can read private key file (chmod og= <key-file-name>). Make sure private key does not have a passphrase.

Running

Optional goodies

Select Java version to use on the cluster

By default, OpenJDK is used

If you wish to use Oracle Java, add an optional argument:

 --use-oracle-java

Enabling Openstack Swift support

You may want to use Openstack Swift object storage as a drop-in addition to your HDFS. To enable it, you should specify:

* `--swift-username <username>` separate user for using Openstack Swift. If you don't specify it, Swift will be
    unavailable by default.
* `--swift-password <password>` separate user password for using Openstack Swift. If you don't specify it, Swift
    will be unavailable by default.

Usage example:

./spark-openstack --create -k borisenko -i /home/al/.ssh/id_rsa -s 10 \
           -t spark.large -a 8ac6a0eb-05c6-40a7-aeb7-551cb87986a2 -n abef0ea-4531-41b9-cba1-442ba1245632 -f public \
           --swift-username shared --swift-password password \
           launch borisenko-cluster

Hadoop usage:

hadoop distcp file:///<file> swift://<swift-container-name>.<openstack-project-name>/<path-in-container>
example: hadoop distcp file:///home/ubuntu/test.txt swift://hadoop.computations/test

Spark usage example:

spark-submit --class <class-name> <path-to-jar> swift://hadoop.computations/test

Warning! This options writes swift-username and swift-password in core-site.xml (in two places) as plain text. You should use it carefully and it's quite reasonable to use separate user for Swift.

Enabling and accessing Jupyter notebook

You may want to use Jupyter notebook engine. To enable it you should use optional command line parameter:

--deploy-jupyter True

Usage example for deploy jupyter on created cluster:

./spark-openstack -k borisenko -i /home/al/.ssh/id_rsa -s 10 \
           -t spark.large -a 8ac6a0eb-05c6-40a7-aeb7-551cb87986a2 -n abef0ea-4531-41b9-cba1-442ba1245632 -f public \
           --deploy-jupyter True \
           launch borisenko-cluster

Jupyter notebook should be started automatically after cluster deploying. Also jupyter may be deployed when creating a cluster.

Master host IP address can be obtained by running ./spark-openstack get-master <cluster-name>. Alternatively, you can look for lines like "msg": "jupyter install finished on 10.10.17.136 (python_version=3)" in the console output.

Open <master-ip>:8888 in a browser. Using two Spark kernels at the same time won't work, so if you want a different Spark kernel shutdown the other one first!

Manually running Jupyter (e.g after cluster restart)

Login to master (either get master IP from OpenStack Web UI or run ./spark-openstack get-master <cluster-name>)

ssh -i <private-key> -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<master-ip>

Make sure it is not already running (e.g. killall python)

Run following command on master:

jupyter notebook --no-browser

Then open <master-ip>:8888 in a browser. Using two Spark kernels at the same time won't work, so if you want a different Spark kernel shutdown the other one first!

Nfs mount

You may want to mount some NFS shares on all the servers in your cluster. To do so you should provide the following optional arguments:

--nfs-share <share-path> <where-to-mount>

Where <share-path> is the address of NFS share (e.g. 1.1.1.1:/share/) and <where-to-mount> is the path in your cluster machines where the share will be mounted (e.g. /mnt/share)

Usage example on created cluster:

./spark-openstack -k borisenko -i /home/al/.ssh/id_rsa -s 10 \
           -t spark.large -a 8ac6a0eb-05c6-40a7-aeb7-551cb87986a2 -n abef0ea-4531-41b9-cba1-442ba1245632 -f public \
           --nfs-share 1.1.1.1:/share/ /mnt/share \
           launch borisenko-cluster

Here's a sample of how to access your NFS share in Spark (Scala):

val a = sc.textFile("/mnt/nfs_share/test.txt")
a.count
val parquetFile = sqlContext.read.parquet("/mnt/nfs_share/test.parquet")
parquetFile.count

Apache Ignite

You may want to deploy Apache Ignite to the same cluster. To do so you should provide the following arguments:

--deploy-ignite

optional arguments:

Usage example:

./spark-openstack --create -k borisenko -i /home/al/.ssh/id_rsa -s 10 \
           -t spark.large -a 8ac6a0eb-05c6-40a7-aeb7-551cb87986a2 -n abef0ea-4531-41b9-cba1-442ba1245632 -f public \
           --deploy-ignite --ignite-memory 30\
           launch borisenko-cluster

Ignite configuration is stored in /opt/ignite/config/default-config.xml. Note that Spark applications that use Apache Ignite still need to use ignite-spark and ignite-spring Spark packages (e.g. for Apache Ignite version 1.7.0 one can run spark shell like this: spark-shell --packages org.apache.ignite:ignite-spark:1.7.0,org.apache.ignite:ignite-spring:1.7.0)

Loading default config can be done as follows:

 import org.apache.ignite.spark._
 import org.apache.ignite.configuration._
 val ic = new IgniteContext(sc, "/opt/ignite/config/default-config.xml")

Apache Cassandra

You may want to deploy Apache Cassandra to the cluster. To do so, provide an argument:

--deploy-cassandra

Optionally, you may specify a version to deploy by providing:

--cassandra-version <version>

Cassandra 3.11.0 is deployed by default

Usage example:

 ./spark-openstack --create -k borisenko -i /home/al/.ssh/id_rsa -s 10 \
           -t spark.large -a 8ac6a0eb-05c6-40a7-aeb7-551cb87986a2 -n abef0ea-4531-41b9-cba1-442ba1245632 -f public \
           --deploy-cassandra\
           launch borisenko-cluster

ElasticSearch

You may want to deploy ElasticSearch to the cluster. Implemented a role for installing ElasticSearch 7.1.1 with Open Distro.

You may deploy ElasticSearch by providing:

--deploy-elastic

Now deployed ES cluster with default configuration for OpenDistro.

Usage example:

./spark-openstack --create -k key_name -i /home/al/.ssh/id_rsa -s 5 \
           -t spark.large -a 8ac6a0eb-05c6-40a7-aeb7-551cb87986a2 -n abef0ea-4531-41b9-cba1-442ba1245632 -f public \
           --deploy-elastic\
           launch elastic-cluster

After executing this command, will be created a ES cluster with 1 master and 5 slaves. You may check cluster with:

curl -XGET https://localhost:9200/_cat/nodes?v -u admin:admin --insecure

Additional actions

There is support for on-fly setup actions. To use it you should run on existing cluster:

cd spark-openstack
./spark-openstack -k <key-pair-name> -i <private-key> -s <n-slaves> \
   -t <instance-type> -a <os-image-id> -n <virtual-network-id> -f <floating-ip-pool> \
   config <cluster-name> <action>

Supported actions:

Configuring cluster without Floating IPs

The ability to create a cluster without floating ip is supported. Since to launch ansible, an ssh connection is required, to work in this case, a VM must be created inside the cloud. This can be done using the command:

./spark-openstack -k <key-pair-name> -i <private-key> -t <instance-type> \
    -a <os-image-id> -n <virtual-network-id> -f <floating-ip-pool> \
    runner <name>

After that, a VM wil be created with the necessary pre-installed packages and this project for further work. You should copy your -openrc.sh file and ssh key to it.

When you connect to this VM, you can create clusters from it without specifying floating IPs:

./spark-openstack --create -k <key-pair-name> -i <private-key> -s <n-slaves> \
    -t <instance-type> -a <os-image-id> -n <virtual-network-id> \
    launch <cluster-name>

Note that created cluster will be unavailable outside the cloud - for this you need manually configure the floating IP.

The VM created for running this cluster creating method can be destroyed as usual.

Important notes

Ansible: 2.8.2 and higher.

Python: 2.7.* (3.x should work as soon as Ansible Openstack modules would be fixed)

Python Openstack SDK: 0.31.0

Management machine OS: Mac OS X Yosemite, Linux Mint 17, Kubuntu 14.04, Windows+Cygwin

Guest OS:

Known issues

TODO Roadmap (you are welcome to contribute)