Exsto: Developer Community Insights

A frequently asked question on the Apache Spark user email list concerns where to find data sets for evaluating the code. Oddly enough, the collection of archived messages for this email list provides an excellent data set for evaluating Spark capabilities, e.g., machine learning, graph algorithms, text analytics, time-series analysis, etc.

Herein, an open source developer community considers itself algorithmically. This project shows work-in-progress for how to surface data insights from the developer email forums for an Apache open source project. It leverages advanced technologies for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the <user@spark.apache.org> email list archives to help understand its community better.

See these talks about Exsto:

DataDayTexas 2015 session talk, [Microservices, Containers, and Machine Learning] (http://www.slideshare.net/pacoid/microservices-containers-and-machine-learning)
Scala Days EU 2015 session, [GraphX: Graph analytics for insights about developer communities] (http://www.slideshare.net/pacoid/graphx-graph-analytics-for-insights-about-developer-communities)

In particular, we will shows production use of NLP tooling in Python, integrated with MLlib (machine learning) and GraphX (graph algorithms) in Apache Spark. Machine learning approaches used include: Word2Vec, TextRank, Connected Components, Streaming K-Means, etc.

Keep in mind that "One Size Fits All" is an anti-pattern, especially for Big Data tools. This project illustrates how to leverage microservices and containers to scale-out the code+data components that do not fit well in Spark, Hadoop, etc.

In addition to Spark, other technologies used include: Mesos, Docker, Anaconda, Flask, NLTK, TextBlob.

Dependencies

https://github.com/opentable/docker-anaconda

conda config --add channels https://conda.binstar.org/sloria
conda install textblob
python -m textblob.download_corpora
python -m nltk.downloader -d ~/nltk_data all
pip install -U textblob textblob-aptagger
pip install lxml
pip install python-dateutil
pip install Flask

NLTK and TextBlob require some data downloads which may also require updating the NLTK data path:

import nltk
nltk.data.path.append("~/nltk_data/")

Running

To change the project configuration simply edit the defaults.cfg file.

scrape the email list via scripts

./scrape.py data/foo.json

parse the email text via scripts

./parse.py data/foo.json parsed/foo.json

Public Data

Example data from the Apache Spark email list is available as JSON:

What's in a name?

The word exsto is the Latin verb meaning "to stand out", in its present active form.