This code goes with my Datanami article.
It illustrates MLLib on Spark using an example based on collaborative filtering of employee ratings for companies.
It shows the exact same Spark client functionality written in Java 7 and Java 8. The new new Java 8 features that make Spark's functional style much easier
I use Cassandra providing the data to Spark, and there's a synthesized training/validation set with accompanying spreadsheet to let you tweak parameters.
Here's how to get it working:
To setup (tested on Ubuntu 14.04):
sudo apt-get install oracle-java8-installer
~/dev
)./bin/run-example SparkPi
Get Eclipse:
Project
pom.xml
, choose Maven-> install.Dataset
ratings.csv
is generated from ratings.ods
, which is a spreadsheet for synthesizing data sets to test and fine tune your model. ratings.ods
and save as CSV. See readme.txt
in data directory for instructions.Cassandra
sudo /usr/bin/cassandra
cqlsh -f ./collabfilter/src/sql/collab_filter_schema.sql
Running tests:
collabfilter.CollabFilterCassandraDriver.main
or the CollabFilterTest
unit test.More references:
pom.xml
specifies Guava 15. This is because the Guava 14 used with the Spark-Cassandra connector is mismatched to the Guava 15 or above expected by Spark, which includes additional methods.