cassandra-spark-analytics

Supercharge your analysis of Cassandra data with Apache Spark

Introduction

Apache Cassandra is a fantastic scalable, fault-tolerant NoSQL database, however it is abhorrently hard to query your data outside the realms of what your datamodel will allow. Apache Spark is a fast and general engine for large-scale data processing.

In this guide I intend to demonstrate the power of Spark to provide analysis of your Cassandra data in ways that were previously thought impossible. This guide is aimed at anyone who is interested in data analysis.

Requirements

This guide has been tested with the software versions outlined below. Older versions may work, test at your own risk.

This guide was developed using CentOS 7.

Software Version
Cassandra 2.2.5
Cqlsh 4.1.1
Cassandra Spark Connector 1.5
Spark 1.6.1
Docker 1.11.1
Docker-compose 1.7.1
Python 2.7.5

Setup

Throughout this guide it will assume your working directory path is the base of this repo, YMMV.

Docker:

Firstly we need docker and docker-compose installed and running. A good startup guide can be viewed here.

Check docker is functioning correctly by issuing command docker ps.

Cassandra:

Start Cassandra container

We will start Cassandra with the included compose/cassandra.yml instruction file. This file can be edited if you know what you are doing but the defaults are fairly sane for this demo.

The Cassandra Thrift port (9160) will bind to 127.0.0.1:9160 and we will use this to connect to cassandra via cqlsh from the host.

docker-compose -f compose/cassandra.yml up -d

Check the container has started with docker ps.

Install Cqlsh

Install cqlsh with command pip install --user cqlsh==4.1.1.

Test you can access cassandra correctly: cqlsh 127.0.0.1

Create schema

Create the Cassandra schema with:

cqlsh 127.0.0.1 -f schema/spark_demo.cql

Import the test dataset (use your relative path):

echo "use spark_demo; COPY person_data (id, first_name, last_name, email, gender, ip_address) FROM 'schema/spark_demo_data.csv' WITH HEADER=true;" | cqlsh 127.0.0.1

If it is sucessful you should get some output like: 1000 rows imported in 0.477 seconds.

I generated the demo data using Mockaroo.

echo "SELECT * FROM spark_demo.person_data;" | cqlsh 127.0.0.1

 id  | email                  | first_name | gender | ip_address    | last_name
-----+------------------------+------------+--------+---------------+-----------
 769 | esanchezlc@comcast.net |     Ernest |   Male | 165.66.44.126 |   Sanchez

...

Spark

We will start Spark with the included compose/spark.yml instruction file. You will need to edit this file to set the relative path to cassandra-spark-analytics/scripts this directory will be mounted into the running Spark container to give Spark access to the processing scripts.

The Spark instructions are located in docker/spark/Dockerfile if you want to roll your own image.

Start Spark container

docker-compose -f compose/spark.yml up -d

Check the container has started with docker ps.

Execute a Spark job

For this demo we are going to group and count all the first names in our person_data table. In cassandra-spark-analytics/scripts is count.py. You will need to alter CASSANDRA_DOCKER_IP and insert the IP address of the running cassandra container:

conf.set("spark.cassandra.connection.host", "CASSANDRA_DOCKER_IP")

To get the Cassandra container's IP address run:

docker inspect --format '{{ .NetworkSettings.IPAddress }}' cassandra

Once the IP address is updated you can execute the Spark job:

docker exec -ti spark /spark/bin/spark-submit /jobs/count.py

When the script finished executing you will be left with output like below:

+----------+-----+
|first_name|count|
+----------+-----+
|    Brenda|   12|
|    Sharon|   12|
|   Deborah|   12|
|     Kelly|   11|
|     Diana|   10|
|    Carlos|   10|
|     Irene|    9|
|   Heather|    9|
|    Louise|    9|
|    Andrea|    9|
|   Jeffrey|    9|
|   Michael|    9|
|    Victor|    9|
|     Terry|    8|
|     Steve|    8|
|    Ernest|    8|
|     Larry|    8|
|      Lois|    8|
|      John|    8|
|   Richard|    8|
|     Louis|    8|
|     Ralph|    8|
|    Joseph|    8|
|      Mary|    8|
|     Billy|    8|
+----------+-----+
only showing top 25 rows

That concludes this demo. If you come up with anything cool submit a PR!