discomll

Disco Machine Learning Library (discomll) is a python package for machine learning with MapReduce paradigm. It works with Disco framework for distributed computing. discomll is suited for analysis of large datasets as it offers classification, regression and clustering algorithms.

Algorithms

Classification algorithms

Clustering algorithms

Regression algorithms

Utilities

Features of discomll

discomll works with following data sources:

discomll enables multiple settings for a dataset:

Installing

Prerequisites

pip install discomll

Performance analysis

In performance analysis, we compare speed and accuracy of discomll algorithms with scikit and Knime. We measure speedups of discomll algorithms with 1, 3, 6 and 9 Disco workers.

Performance analysis 2

In second performance analysis, we compare accuracy of distributed ensemble algorithms with scikit-learn algorithms. We train the model on whole dataset with distributed algorithms and on a subset with single core algorithms. We show that distributed ensembles achieve similar accuracy as single core algorithms.

Try it now

You can try discomll algorithms on the ClowdFlows platform. ClowdFlows is an open sourced cloud based platform for composition, execution, and sharing of interactive machine learning and data mining workflows. For instruction see the User Guide.

alt tag

Public workflows:

Release notes

version 0.1.4.2 (Released 18/oct/2015)

version 0.1.4.1 (Released 17/oct/2015)

version 0.1.4 (Released 11/oct/2015)