Install Hadoop and HBase on Ubuntu

HBase is an open-source distributed non-relational database written in Java. It has become one of the dominant databases in big data. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS.

This post demonstrates how to set up Hadoop and HBase to run on a single machine. Running HDFS, YARN and HBase on a single machine is a quick way for testing Hadoop. I will install on a brand new virtual machine, so that every step is covered and can be duplicated. There are a number of different ways to build a Hadoop cluster, such as using Apache tarballs, using linux packages (e.g. RPM and Debian) and using Hadoop cluster management tools (e.g. Apache Ambari). This post shows how to install using the Apache tarballs. This is a preferred way to understand the configurations of Hadoop.

1. Prepare a Virtual Machine Environment

To have a brand new linux box, I created a VM on digitalocean.com. The linux version used is Ubuntu 16.04.4 X64 and the VM has 4G memory.

It is recommended to create a user instead of running on root. Create a group 'hadoop' and a user 'hduser'. And grant root permission to the user.

$ addgroup hadoop
$ adduser --ingroup hadoop hduser
$ usermod -aG sudo hduser

Hadoop requires SSH access to manage its nodes. Therefore, we need to configure SSH access to localhost for the hduser user we created.

$ su hduser
$ ssh-keygen -t rsa -P ""

The second line will create an RSA key pair with an empty password. We don't want to enter the pass phrase every time Hadoop interacts with its nodes.

The following commands enable SSH access to the local machine with this newly created key.

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

You can test the password-less login with:

$ ssh localhost
$ exit

2. Install Java

Install Oracle's JDK version "1.8.0_181".

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Verify the installation of Java.

$ java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

3. Install Hadoop

The available Hadoop releases are available at http://hadoop.apache.org/releases.html.

$ cd ~/
$ wget http://apache.claz.org/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
$ tar xzvf hadoop-3.0.3.tar.gz
$ sudo mv hadoop-3.0.3 /usr/local/hadoop
$ sudo chown -R hduser:hadoop /usr/local/hadoop

Edit /home/hduser/.profile to include the required environment variables. You may find the Java installation directory using this command:

$ ls -al /etc/alternatives/java
/etc/alternatives/java -> /usr/lib/jvm/java-8-oracle/jre/bin/java

Change the .profile file to the following:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$PATH:$JAVA_HOME/bin

Edit hadoop-env.sh (under /usr/local/hadoop/etc/hadoop/hadoop-env.sh).

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open hadoop-env.sh in the editor of your choice and add the following line to set JAVA_HOME environment variable to the Oracle JDK 8 directory.

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Test the download with the example mapreduce job. In this example, we count the number of occurrence of "file[.]*".

$ mkdir ~/input
$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input
$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep ~/input ~/count_example 'file[.]*'
$ cat ~/count_example/*
12	file
10	file.

Set HADOOP_HOME and PATH by adding the following lines to the .profile file and source it.

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
$ . ~/.profile

Edit core-site.xml (under /usr/local/hadoop/etc/hadoop).

The core-site.xml file contains information of common settings, such as the port number used for Hadoop instance, memory allocated for file system, memory limit for storing data, and the size of Read/Write buffers.

<configuration>
  <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

Edit hdfs-site.xml (under /usr/local/hadoop/etc/hadoop).

The hdfs-site.xml file contains information such as the value of name node path, secondary name node, and data node path of your local file systems, where you want to store the Hadoop infrastructure.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
 
   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hduser/hadoopinfra/hdfs/namenode</value>
   </property>
 
   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hduser/hadoopinfra/hdfs/datanode</value>
   </property>
</configuration>

Edit yarn-site.xml (under /usr/local/hadoop/etc/hadoop). The yarn-site.xml file configures settings for YARN daemons: the resource manager, the web app proxy server, and the node managers.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

Edit mapred-site.xml (under /usr/local/hadoop/etc/hadoop).

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of mapred-site.xml.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Format the HDFS filesystem.

The first step to starting up Hadoop is formatting the Hadoop file system which is implemented on top of the local file system of the cluster (only the local machine in this tutorial). This should be done the first time when you set up a Hadoop cluster. Do not format a running Hadoop file system as you will lose all the data currently in the cluster! Note that data nodes are not involved in the formatting process, because name nodes manage metadata of all filesystems and data node can join and leave the cluster on the fly. Formatting the file system initializes the directory specified by the dfs.name.dir variable in hdfs-site.xml:

$ hdfs namenode -format

Last line of the output is:

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu-s-2vcpu-4gb-sfo2-01/127.0.1.1
************************************************************/

Start dfs.

$ start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ubuntu-s-2vcpu-4gb-sfo2-01]
ubuntu-s-2vcpu-4gb-sfo2-01: Warning: Permanently added 'ubuntu-s-2vcpu-4gb-sfo2-01' (ECDSA) to the list of known hosts.

Start yarn.

$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers

Check DFS Health using portal at http://your_ip_address:9870/dfshealth.html.

Check Hadoop Cluster Overview at http://your_ip_address:8088/cluster/cluster.

Once you have a Hadoop cluster set up and running, you can access the directories. For example, access the root directory:

$ hadoop fs -la /

Create a root directory "test":

$hadoop fs -mkdir /test

4. Install HBase

The available release of HBase is located at: http://www-us.apache.org/dist/hbase/. Version 1.2.6.1 is used in this tutorial.

$ cd ~/
$ wget http://www-us.apache.org/dist/hbase/stable/hbase-1.2.6.1-bin.tar.gz
$ tar xzvf hbase-1.2.6.1-bin.tar.gz
$ sudo mv hbase-1.2.6.1 /usr/local/hbase
$ sudo chown -R hduser:hadoop hbase

Add the following line to ~/.profile and re-source it.

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin

Edit JAVA_HOME in shell script hbase-env.sh under folder /usr/local/Hbase/conf/.

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Edit hbase-site.xml under /usr/local/hbase/conf/hbase-site.xml.

<configuration>
   <property>
      <name>hbase.rootdir</name>
      <value>hdfs://localhost:8030/hbase</value>
   </property>
 
   <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/home/hduser/zookeeper</value>
   </property>
 
   <property>
     <name>hbase.cluster.distributed</name>
     <value>true</value>
   </property>
</configuration>

Now start hbase, and check master status at: http://your_ip_address:16010/master-status.

$ start-hbase.sh 
starting master, logging to /usr/local/hbase/logs/hbase-hduser-master-ubuntu-s-2vcpu-4gb-sfo2-01.out

You can now play with HBase by using the HBase shell:

[email protected]:/usr/local/hbase/bin$ hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 1.2.6.1, rUnknown, Sun Jun  3 23:19:26 CDT 2018

The following command shows how to create a table "test" with one column family "info" and add a row.

hbase(main):002:0> create 'test', 'info'
0 row(s) in 2.3660 seconds

=> Hbase::Table - test
hbase(main):003:0> put 'test', 'row1', 'info:name','programcreek'
0 row(s) in 0.3760 seconds

hbase(main):004:0> put 'test', 'row1', 'info:type','blog'
0 row(s) in 0.0210 seconds

hbase(main):005:0> get 'test', 'row1'
COLUMN                       CELL                                                                              
 info:name                   timestamp=1533271202442, value=programcreek                                       
 info:type                   timestamp=1533271213622, value=blog                                               
2 row(s) in 0.1030 seconds

hbase(main):007:0> scan 'test'
ROW                          COLUMN+CELL                                                                       
 row1                        column=info:name, timestamp=1533271202442, value=programcreek                     
 row1                        column=info:type, timestamp=1533271213622, value=blog                             
1 row(s) in 0.0540 seconds

If you want to delete the table, you must first disable it before dropping it:

hbase(main):001:0> disable 'test'
0 row(s) in 3.1120 seconds

hbase(main):002:0> drop 'test'
0 row(s) in 1.2890 seconds

hbase(main):003:0> list
TABLE                                                                                                          
SYSTEM.CATALOG                                                                                                 
SYSTEM.FUNCTION                                                                                                
SYSTEM.LOG                                                                                                     
SYSTEM.MUTEX                                                                                                   
SYSTEM.SEQUENCE                                                                                                
SYSTEM.STATS                                                                                                   
user                                                                                                           
7 row(s) in 0.0310 seconds

=> ["SYSTEM.CATALOG", "SYSTEM.FUNCTION", "SYSTEM.LOG", "SYSTEM.MUTEX", "SYSTEM.SEQUENCE", "SYSTEM.STATS", "user"]
hbase(main):004:0> 

If you are interested, you may also check out some Java code examples for showing how to use HBase in Java.

Category >> big data  
If you want someone to read your code, please put the code inside <pre><code> and </code></pre> tags. For example:
<pre><code> 
String foo = "bar";
</code></pre>
  • Kirill Karmazin

    Thanks for the article but it’s actually a copy-paste form this one: https://medium.com/@yzhong.cs/hbase-installation- step-by-step-guide-cb73381a7a4c

    But you didn’t mention that wget hadoop should be proceeded from /usr/local/ dir. By default it will download an archive into home dir.

    If you copy-paste – do it for all text.

  • quangthong81

  • Arush Agarwal

    Even I was having trouble while installing Hadoop in my php web development company in Bangalore. This article really helped me in solving my issue which I was facing while installing. Thanks a lots man great article. Loved it Kudos