The primary change to V2 of the tool is to simplify by integrating all external processes so one line will execute all steps.
The steps to execute all workflow are:
pio status
will do this.config.json
file. map_test split ...
directive. It is often desirable to use an existing data split so the split step can be omitted.pio build
This will create the Universal Recommender code and register the algorithm parameterspio train
This will create a model with the UR from the engine.json
parameters. There are parameters in engine.json that are passed to Spark in the training process that are system and data dependent so make sure train completes correctly before moving on. Using these tools usually happens after the bootstrap dataset has be3en successfully trained so the split is made on the dataset.pio deploy
This will create a running PIO PredictionServer that responds to UR queries based on the training split of the dataset.map_test test ...
Install Spark, PredictionIO v0.11.0 or greater, and the Universal Recommender v0.7.3 or greater. Make sure pio status
completes with no errors and the integration-test for the UR runs correctly.
Install Python and check the version
python3 --v
if the version is less than 3.x upgrade to the most recent stable version of python3 using systems package management tools like apt-get
for Ubuntu linux or brew
for the macOS. This tool has been tested minimally with python3 and does require it. Leave an issue if you find one.
Install Python libraries using the Python package manager found here Note that if you have python3 running you may alrady have pip3, which pip3
will check. If you don't have it installed the above link will work with python3 so install it first.
sudo pip3 install numpy scipy pandas ml_metrics predictionio tqdm click openpyxl
Setup Spark and Pyspark paths in .bashrc
(linux) or .bash_profile
macOS.
export SPARK_HOME=/path/to/spark
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/build/:$PYTONPAHTH
Analysis script should be run from UR (Universal recommender) folder. It uses two configuration files:
engine.json
(configuration of UR, this file is used to take event list and primary event)config.json
(all other configuration including engine.json path if necessary)config.json has the following structure:
{
"engine_config": "./engine.json",
"splitting": {
"version": "1",
"source_file": "hdfs:...<PUT SOME PATH>...",
"train_file": "hdfs:...<PUT SOME PATH>...train",
"test_file": "hdfs:...<PUT SOME PATH>...test",
"type": "date",
"train_ratio": 0.8,
"random_seed": 29750,
"split_event": "<SOME NAME>"
},
"reporting": {
"file": "./report.xlsx"
},
"testing": {
"map_k": 10,
"non_zero_users_file": "./non_zero_users.dat",
"consider_non_zero_scores_only": true,
"custom_combos": {
"event_groups": [["ev2", "ev3"], ["ev6", "ev8", "ev9"]]
}
},
"spark": {
"master": "spark://<some-url>:7077"
}
}
Get data from the EventServer with:
pio export --output path/to/store/events
Use this command to run split of data into "train" and "test" sets
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py split
Additional options are available:
--csv_report
- put report to csv file not excel--intersections
- calculate train / test event intersection data (Advanced)The above command will create a test and training split in the location specified in config.json. Now you must import, setup engine.json, train and deploy the "train" model so the rest of the MAP@k tests will be able to query the model.
To run tests
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py test --all
Additional options are available and may be used to run not all test:
--csv_report
---dummy_test
- run dummy test--separate_test
- run test for each separate event--all_but_test
- run test with all events and tests with all but each every event--primary_pairs_test
- run tests with all pairs of events with primary event--custom_combos_test
- run custom combo tests as configured in config.json--non_zero_users_from_file
- use list of users from file prepared on previous script run to save timeTodo
This is not recommended old approach to run ipython notebook.
IPYTHON_OPTS="notebook" /usr/local/spark/bin/pyspark --master spark://spark-url:7077