A flexible (but opinionated) toolkit for doing and sharing reproducible data science.
EasyData started life as an experimental fork of cookiecutter-data-science where we could try out ideas before proposing them as fixes to the upstream branch. It has grown into its own toolkit for implementing a reproducible data science workflow, and is the basis of our Bus Number tutorial on Reproducible Data Science.
For a tutorial on making use of this framework, visit: https://github.com/hackalog/bus_number/
anaconda (or miniconda)
python3.6+ (we use f-strings. So should you)
Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda config --add channels conda-forge
$ conda install cookiecutter
cookiecutter https://github.com/hackalog/cookiecutter-easydata
The directory structure of your new project looks like this:
LICENSE
Makefile
make
for a list of valid commandsREADME.md
catalog
data
data/raw
data/interim
data/processed
docs
models
models/trained
models/output
notebooks
-
delimited description,
e.g. 1.0-jqp-initial-data-exploration
.references
reports
reports/figures
reports/tables
reports/summary
environment.yml
setup.py
MODULE_NAME
into a
pip-installable python module (pip install -e .
) so it can be
imported in python codeMODULE_NAME
MODULE_NAME/__init__.py
MODULE_NAME/data
MODULE_NAME/data/make_dataset.py
python -m MODULE_NAME.data.make_dataset fetch
or python -m MODULE_NAME.data.make_dataset process
MODULE_NAME/analysis
MODULE_NAME/models
predict_model.py
, train_model.py
tox.ini
The first time:
make create_environment
git init
git add .
git commit -m "initial import"
git branch easydata # tag for future easydata upgrades
Subsequent updates:
make update_environment
In case you need to delete the environment later:
conda deactivate
make delete_environment