Install Spark on Ubuntu (1): Local Mode

This post shows how to set up Spark in the local mode. The cluster is standalone without any cluster manager (YARN or Mesos) and it contains only one machine. The purpose is to quickly set up Spark for trying something out.

1. Prepare a VM

Start a brand new VM with a cloud service provider, such as AWS, Linode, DigitalOcean, etc. There are several reasons to use a VM. First, spark is a big data platform that eventually will be set up in the cloud. A VM in the cloud can be extended to a fully functional cluster in the future. Second, you don't want mess up with your local machine. The VM can be destroy after you are done with the practice. My preference for this post is DigitalOcean, as it is very easy to start with. AWS EC2 is not a bad choice, but the number of concepts (security group, EFS, EBS, etc) may overwhelm new learners.

Create a user and grant sudo permission to the created user.

$ adduser spark
$ usermod -aG sudo spark

2. Install Spark

Download the latest release of Spark https://spark.apache.org/downloads.html.

Switch to spark user and download:

$ su spark
$ wget http://apache.claz.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

Unzip the archive.

$ tar -xvf spark-2.4.0-bin-hadoop2.7.tgz

Move the Spark folder to /usr/local/ and create a symbolic link so that you can install and switch another version in the future.

$ sudo mv spark-2.4.0-bin-hadoop2.7 /usr/local/
$ sudo ln -s /usr/local/spark-2.4.0-bin-hadoop2.7/ /usr/local/spark

Define SPARK_HOME environment variable. Add the following lines to the end of .profile file:

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

And source the .profile file.

. ~/.profile

3. Install Java (TODO: check if needed)

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt update
$ sudo apt install oracle-java8-installer

Add the following lines to the end of .profile file:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

And source .profile again.

4. Start Spark Shell

Since we have add /usr/local/spark/bin to PATH, spark-shell can be started by running:

$ spark-shell

You can now play with spark RDD, data frame, etc.

spark-shell is a special client. We can check the SparkContext object values by printing the object got from sc.getConf.

We can see spark.master is set to local[*], which means spark runs a single executor using a number of threads equals to the number of CPU cores available on the local machine. It determines how many tasks can be executed in parallel.

Each time a SparkContext object is initialized, Spark starts a web UI which is accessible by http://ipaddress:4040. The following shows a snapshot of the web UI. We can get details of the environment here.

Category >> spark  
If you want someone to read your code, please put the code inside <pre><code> and </code></pre> tags. For example:
<pre><code> 
String foo = "bar";
</code></pre>