Install Spark on Ubuntu (2): Standalone Cluster Mode

In the previous post, I set up Spark in local mode for testing purpose. In this post, I will set up Spark in the standalone cluster mode.

1. Prepare VMs

Create 3 identical VMs by following the previous local mode setup (Or create 2 more if one is already created). The spark directory needs to be on the same location (/usr/local/spark/ in this post) across all nodes.

2. Set up Password-less SSH

Because the master node uses SSH to start the slave nodes, it is convenient to use password-less SSH. Create a key pair in master node and copy the public key to each slave node by doing:

ssh-keygen -t rsa
ssh-copy-id [email protected]
ssh-copy-id [email protected]

159.89.9.67 and 167.99.130.96 are the IP addresses of my two newly created VMs. By default, the IPs are public and no firewall is set up.

3. Config Master Node

Give the two slave nodes hostnames by append the following lines to /etc/hosts.

159.89.9.67 slave1
167.99.130.96 slave2

Define slave nodes in /usr/local/spark/conf/slaves. There is a slaves.template file under /usr/local/spark/conf. You can start with that file by renaming it to "slaves":

$ cd /usr/local/spark/conf
$ cp slaves.template slaves

Add the list of worker hostnames, each on a separate line.

slave1
slave2

The slaves file looks like the following:

When a salve node connects to master, it requires the address of the master node along with the port number. By default, the host name is used. Slave nodes can not use the hostname to connect the master node. The hostname needs to be changed to IP address so that slave nodes can connect to.

This can be configured in $SPARK_HOME/conf/spark-env.sh file. There is a spark-env.sh.template file. You can rename that file to "spark-env.sh".

Edit the spark-env.sh file and add the SPARK_MASTER_HOST variable by adding the following line at the end of the file. Replace the IP address below with your IP address. View your IP address using the hostname -I command.

SPARK_MASTER_HOST="167.99.130.228"

4. Start Spark Cluster

Start Spark Master.

$ $SPARK_HOME/sbin/start-master.sh

Once master is started, spark starts a master web UI. It's default URL is: http://youipaddress:8080.

Now start slaves.

$ $SPARK_HOME/sbin/start-slaves.sh

As shown in the snapshot, the web UI shows the master and workers. I have 2 worker nodes as expected.

5. Run Spark Shell in the standalone cluster mode

The standalone cluster is now ready to accept jobs. We can test it using spark-shell.

$ spark-shell --master spark://167.99.130.228:7077

The --master option specifies the master URL for a distributed cluster. The spark web ui (http://167.99.130.228:4040) shows the environment details as the following:

6. Add more slave nodes

If more worker nodes are needed, a new node can be added to the cluster by running:

$ $SPARK_HOME/sbin/start-slave.sh spark://167.99.130.228:7077

The start-slave.sh script starts a slave node using the given master URI. This script needs to be run on the slave node (not on the master node, unless you want a master to be also a slave node) that is to be added to the cluster.

I run the command in the master node, so the master also becomes a slave node. After running, Spark Master UI (http://167.99.130.228:8080) looks like this:

There are 3 worker nodes now in the cluster.

References:
1. https://spark.apache.org/docs/latest/spark-standalone.html

Category >> spark  
If you want someone to read your code, please put the code inside <pre><code> and </code></pre> tags. For example:
<pre><code> 
String foo = "bar";
</code></pre>