xcluster contains algorithms and evaluation tools for extreme clustering, i.e., instances of clustering in which the number of points to be clustered and the number of clusters is large. Most notably, xcluster contains an implementation of PERCH (Purity Enhancing Rotations for Cluster Hierachies). PERCH is an online extreme clustering algorithm that incrementally builds a tree with data points at its leaves. During the data point insertion procedure, PERCH performs rotations to keep the tree accurate and as balanced as possible. Empirical experiments show that PERCH produces purer trees faster than other algorithms; theoretical analysis shows that for separable data, PERCH builds trees with perfect dendrogram purity regardless of the order of the data. Technical details and analysis of the algorithm can be found in our paper: A Hierarchical Algorithm for Extreme Clustering
If running the python code, download and Install Anaconda's Python3
https://docs.continuum.io/anaconda/install
If running python code, install numba
conda install numba
Set environment variables:
source bin/setup.sh
Install maven if you don't already have it installed:
./bin/util/install_mvn.sh
Build Scala code:
./bin/build.sh
Download data
./bin/download_data.sh
Run Test on Separated Data:
./bin/test/test_perch_dendrogram_purity.sh
Run PERCH on Small Scale Data (glass dataset):
# Hierarchical clustering
./bin/hierarchical/glass/run_perch.sh
# Flat clustering
./bin/flat/glass/run_perch.sh
Run PERCH on ALOI (see notes below for suggested system environment):
# Hierarchical clustering
./bin/hierarchical/aloi/run_perch.sh
# Flat clustering
./bin/flat/aloi/run_perch.sh
Run Test on Separated Data:
./bin/test/test_perch_dendrogram_purity_py.sh
Run PERCH on Small Scale Data (glass dataset):
# Hierarchical clustering
./bin/hierarchical/glass/run_perch_py.sh
Run PERCH on ALOI:
# Hierarchical clustering
./bin/hierarchical/aloi/run_perch_py.sh