Install Spark on Ubuntu (1): Local Mode
This post shows how to set up Spark in the local mode. The cluster is standalone without any cluster manager (YARN or Mesos) and it contains only one machine. The purpose is to quickly set up Spark for trying something out.
1. Prepare a VM
Start a brand new VM with a cloud service provider, such as AWS, Linode, DigitalOcean, etc. There are several reasons to use a VM. First, spark is a big data platform that eventually will be set up in the cloud. A VM in the cloud can be extended to a fully functional cluster in the future. Second, you don't want mess up with your local machine. The VM can be destroy after you are done with the practice. My preference for this post is DigitalOcean, as it is very easy to start with. AWS EC2 is not a bad choice, but the number of concepts (security group, EFS, EBS, etc) may overwhelm new learners.
Create a user and grant sudo permission to the created user.
$ adduser spark $ usermod -aG sudo spark
2. Install Spark
Download the latest release of Spark https://spark.apache.org/downloads.html.
Switch to spark user and download:
$ su spark $ wget http://apache.claz.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
Unzip the archive.
$ tar -xvf spark-2.4.0-bin-hadoop2.7.tgz
Move the Spark folder to /usr/local/ and create a symbolic link so that you can install and switch another version in the future.
$ sudo mv spark-2.4.0-bin-hadoop2.7 /usr/local/ $ sudo ln -s /usr/local/spark-2.4.0-bin-hadoop2.7/ /usr/local/spark
Define SPARK_HOME environment variable. Add the following lines to the end of .profile file:
export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/bin
And source the .profile file.
. ~/.profile
3. Install Java (TODO: check if needed)
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt update $ sudo apt install oracle-java8-installer
Add the following lines to the end of .profile file:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
And source .profile again.
4. Start Spark Shell
Since we have add /usr/local/spark/bin to PATH, spark-shell can be started by running:
$ spark-shell
You can now play with spark RDD, data frame, etc.
spark-shell is a special client. We can check the SparkContext object values by printing the object got from sc.getConf.
We can see spark.master is set to local[*], which means spark runs a single executor using a number of threads equals to the number of CPU cores available on the local machine. It determines how many tasks can be executed in parallel.
Each time a SparkContext object is initialized, Spark starts a web UI which is accessible by http://ipaddress:4040. The following shows a snapshot of the web UI. We can get details of the environment here.
<pre><code> String foo = "bar"; </code></pre>
-
ilyas