Twitter sentiment analysis with Spark MLlib and visualization

Introduction

Project to analyse and visualize sentiment of tweets in real-time on a world map using Apache Spark ecosystem [Spark MLlib + Spark Streaming].

At a very high level, this project encapsulates and covers each of the following broad topics:

For more details on this project and the code associated with it, please check this blogpost.
Also, a Docker Image is available on Docker Hub with the complete environment and dependencies installed and preconfigured.

Note:

I had actually written a blog post on my personal website with the code walkthru and explaining intricate details; but unfortunately I managed to corrupt my Octopress GitHub repo. :anguished: :weary: :rage: So, till the time I salvage it, I thought of publishing it as GitHub wiki for the time being.

Visualization Demo and screenshots

Demo of visualization

Demo of visualization

Screenshots of visualization

Overview

Overview

Positive sentiment

Positive sentiment

Neutral sentiment

Neutral sentiment

Negative sentiment

Negative sentiment

Features

Docker Image and Dockerfile

Dependencies

Following is the complete list of languages and frameworks used and their significance in this project.

  1. OpenJDK 64-Bit v1.8.0_102 » Java for compiling and execution; the VM to be precise
  2. Scala v2.10.6 » basic infrastructure and Spark jobs
  3. SBT v0.13.12 » build script and uber jar creation
  4. Apache Spark v1.6.2
    • Spark Streaming » connecting to Twitter and streaming the tweets
    • Spark MLlib » creating a ML model and predicting the sentiment of tweets based on the text
    • Spark SQL » saving tweets [both raw and classified]
  5. Stanford CoreNLP v3.6.0 » alternative approach to find sentiment of tweets based on the text
  6. Redis » publishing classified tweets; subscribed by the front-end app to render the chart
  7. Datamaps » chart and visualization
  8. Python » run the flask app for rendering the front-end
  9. Flask » render the template for front-end

Also, please check build.sbt for more information on the various other dependencies of the project.

Prerequisites for successful execution

Env Setup

If not already installed, please install Docker on your machine.

We will be using the accompanying Docker image created for this project.

Resources for the Docker machine

Execution

Run the Docker image

docker run -ti -p 4040:4040 -p 8080:8080 -p 8081:8081 -p 9999:9999 -h spark --name=spark p7hb/p7hb-docker-mllib-twitter-sentiment:1.6.2

Please note:

Twitter App OAuth credentials

Execute Spark Streaming job for sentiment prediction

Visualization app

Further work and improvement areas

Expert mode execution steps

This is a very quick recap / summary of the steps required for execution of this code.
Please consider these steps only if you are an expert on Docker, Spark and ecosystem of this project and understand clearly what is being done here.

Note:

Please do not forget to modify the Twitter App OAuth credentials in the file application.conf.
Please check Twitter Developer page for more info.

Helpful links

  1. I am currently hosting this web app on Amazon EC2: http://54.84.252.184:9999/. I will bring it down sometime next week. Update on 19th September, 2016: After running the live app on EC2 for almost a month, I have shutdown this instance today.
  2. Docker Image on Docker Hub Registry: https://hub.docker.com/r/p7hb/p7hb-docker-mllib-twitter-sentiment/.
  3. GitHub URL for source code of the project: https://github.com/P7h/Spark-MLlib-Twitter-Sentiment-Analysis.
  4. GitHub URL for blog post on code walkthru: https://github.com/P7h/Spark-MLlib-Twitter-Sentiment-Analysis/wiki/.
  5. Dockerfile GitHub repo: https://github.com/P7h/p7hb-docker-mllib-twitter-sentiment.

Problems? Questions? Contributions? Contributions welcome

If you find any issues or would like to discuss further, please ping me on my Twitter handle @P7h or drop me an email. Appreciate your help. Thanks!

License License

Copyright © 2016 Prashanth Babu.
Licensed under the Apache License, Version 2.0.