A frequently asked question on the Apache Spark user email list concerns where to find data sets for evaluating the code. Oddly enough, the collection of archived messages for this email list provides an excellent data set for evaluating Spark capabilities, e.g., machine learning, graph algorithms, text analytics, time-series analysis, etc.
Herein, an open source developer community considers itself algorithmically.
This project shows work-in-progress for how to surface data insights from
the developer email forums for an Apache open source project.
It leverages advanced technologies for natural language processing, machine
learning, graph algorithms, time series analysis, etc.
As an example, we use data from the
email list archives to help understand
its community better.
See these talks about
In particular, we will shows production use of NLP tooling in Python, integrated with MLlib (machine learning) and GraphX (graph algorithms) in Apache Spark. Machine learning approaches used include: Word2Vec, TextRank, Connected Components, Streaming K-Means, etc.
Keep in mind that "One Size Fits All" is an anti-pattern, especially for Big Data tools. This project illustrates how to leverage microservices and containers to scale-out the code+data components that do not fit well in Spark, Hadoop, etc.
conda config --add channels https://conda.binstar.org/sloria conda install textblob python -m textblob.download_corpora python -m nltk.downloader -d ~/nltk_data all pip install -U textblob textblob-aptagger pip install lxml pip install python-dateutil pip install Flask
NLTK and TextBlob require some data downloads which may also require updating the NLTK data path:
import nltk nltk.data.path.append("~/nltk_data/")
To change the project configuration simply edit the
./parse.py data/foo.json parsed/foo.json
Example data from the Apache Spark email list is available as JSON:
The word exsto is the Latin verb meaning "to stand out", in its present active form.