Build status Code coverage Supported JVM License

clust4j

A Java-based set of classification clustering algorithms. Built under JDK 1.7 and tested up to JDK 1.8. This project is currently in ongoing development for the time being and should not be used in production environments


Installation:

clust4j 1.1.6 prerelease is now available under releases. To build the bleeding edge, use the included gradle wrapper:

git clone https://github.com/tgsmith61591/clust4j.git
cd clust4j
./gradlew build ## Go grab some coffee... this takes a minute
## jar is at build/libs/clust4j-x.x.x.jar and includes dependencies

clust4j includes some datasets:

Get started quickly with clust4j by loading one of the built-in datasets:

Usage:

DataSet data;
data = ExampleDataSets.loadIris();
data = ExampleDataSets.loadWine();
data = ExampleDataSets.loadBreastCancer();

Simple example data (to use for reproducability):

final int k = 2;
final Array2DRowRealMatrix mat = new Array2DRowRealMatrix(new double[][] {
    new double[] {0.005,     0.182751,  0.1284},
    new double[] {3.65816,   0.29518,   2.123316},
    new double[] {4.1234,    0.2301,    1.8900002}
}); 

Currently implemented algorithms:

Evaluating performance

All clustering algorithms that implement Classifier can also be scored. Supervised and unsupervised methods are scored in different manners. If we want to score the NearestCentroid model we fit above:

// implicitly uses ground truth and predicted labels
double accuracy = nc.score();

Conversely, if we'd like to score an unsupervised algorithm, we have a few options. Every UnsupervisedClassifier implements a silhouetteScore method, but also implements a home-grown scoring algorithm called INDEX_AFFINITY. This operates based on the following complication:

// our k-means model predicts labels like this:
km.getLabels(); // {0,1,1}
// ...but the ground truth may actually be {9,15,15}

... this method, then, is an attempt to measure accuracy not traditionally, but by accounting for predicted segmentation in regards to actual label segmentation. It works by penalizing indices which are inappropriately associated with incorrect neighbor indices. In this regard a ground truth set of {0,1,0,2,2} and predicted set of {2,0,2,1,1} would be 100% accurate. Use as follows:

km.indexAffinityScore(new int[]{50,10,10}); // 100%

Alternatively, the silhouetteScore method can be called as follows:

km.silhouetteScore(); // Uses internally predicted labels

Making predictions

Any algorithm implementing BaseClassifier or its sub-interfaces will implement the method predict(AbstractRealMatrix). After the model has been fit, predictions can be made as follows (using KMeans for example):

KMeans model = new KMeansParameters(2).fitNewModel(mat); // fit the model on the training data
Array2DRowRealMatrix newData = new Array2DRowRealMatrix(new double[][]{ // here's the new data
    new double[]{0.0, 0.0, 0.0},
    new double[]{3.0, 0.0, 2.0}
});
int[] predicted_labels = model.predict(newData); // returns {0, 1}

Note that HDBSCAN does not currently implement a functional predict method, and will throw an UnsupportedOperationException.

Putting it all together with the TrainTestSplit

clust4j includes a TrainTestSplit class that will shuffle and split a DataSet given a training ratio. This can be useful for both SupervisedClassifiers as well as UnsupervisedClassifiers (especially when leveraging clust4j's own proprietary INDEX_AFFINITY as a scoring metric). Here's an example of how this can be used with KMeans:

// Load data, get splits
TrainTestSplit split = new TrainTestSplit(ExampleDataSets.loadIris(), 0.70);
DataSet train = split.getTrain(), test = split.getTest();

// Fit a model and get predictions
KMeans model = new KMeansParameters(3).fitNewModel(train.getData());
int[] predicted_labels = model.predict(test.getData());

// Get the "affinity" score using the Index Affinity algorithm:
double affinity = ClassificationScoring.INDEX_AFFINITY.evaluate(test.getLabels(), predicted_labels)

Put your model into a Pipeline:

clust4j includes a Pipeline object to feed data through a series of ordered PreProcessors and finally, into a clustering model (see below, in the utilities section for more info). Here's an example:

// Let's predict whether a tumor is malignant or benign:
TrainTestSplit split = new TrainTestSplit(ExampleDataSets.loadBreastCancer(), 0.7);

// Get the splits
DataSet training = split.getTrain();
DataSet holdout  = split.getTest();

// Initialize pipe
SupervisedPipeline<NearestCentroid> pipeline = new SupervisedPipeline<NearestCentroid>(
    // This is the parameter class for the model that will be fit
    new NearestCentroidParameters()
        .setVerbose(true)
        .setMetric(new GaussianKernel()),

    // These are the preprocessors that will transform the training data:
    new StandardScaler(), // first transformation
    new MinMaxScaler(),   // second transformation
    new PCA(0.85)         // final transformation
);

// Fit the pipeline:
pipeline.fit(training.getData(), training.getLabels());

// Make predictions:
int[] predicted_labels = pipeline.predict(holdout.getData());

Separability metrics

A number of separability metrics are available for use:


Kernel functions:

ANOVA Kernel

Cauchy kernel

Circular kernel

Exponential kernel

Gaussian kernel

Sigmoid kernel

Laplacian kernel

Linear kernel

Log kernel

Quadric kernels

Polynomial kernel

Power kernel

Radial basis kernel

Rational quadratic kernel

Spherical kernel

Spline kernel


Notice the differentiation between similarity-based and distance-based geometrically separable metrics. All the clustering algorithms are able to handle any metric implementing the GeometricallySeparable interface. Since SimilarityMetric's getSimilarity(...) method should return the negative of getDistance(...), separability metrics implementing SimilarityMetric implicitly will attempt to maximize similarity (as all clustering algorithms will minimize distance). Classes implementing SimilarityMetric should define the getDistance(double[], double[]) method to return -getSimilarity(double[], double[]).

When to use similarity metrics over distance metrics?

Various similarity metrics—kernel methods, in particular—allow the clustering algorithm to segment the data in Hilbert Space[4], which can—assuming the proper kernel is selected—allow the algorithm to identify "complex," or non-(hyper)spherically shaped clusters:

Image

Whereas a distance metric in Euclidean space may struggle with oddly shaped clusters[9]:

Image

To initialize any clusterer with a kernel as the GeometricallySeparable metric (example uses GaussianKernel):

final Kernel kernel = new GaussianKernel();
KMedoids km = new KMedoids(mat, new KMedoidsPlanner(k).setSep(kernel));

Note: though similarity metrics may be used with any clustering algorithm, it is recommended that they not be used with density-based clustering algorithms, as they seek "neighborhoods" around points and similarity metrics such as kernels will not accurately describe a point's neighborhood. Using a similarity metric with a density-based algorithm will cause a warning to be logged and the algorithm will fall back to the default separability metric (Euclidean distance).


Utilities


Things to note:


References:
  1. Souza C.R., Kernel Functions for Machine Learning
  2. Yu K., Ji L., Zhang X., Kernel Nearest-Neighbor Algorithm
  3. Ester M., Kriegel H.-P., Sander J.S., Xu X., 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Institute for Computer Science, University of Munich
  4. Chitta, R., Kernel Clustering
  5. kernlab R package
  6. h2o (for log wrapper structure)
  7. Divisive Clustering
  8. sklearn clustering repository
  9. The Drawbacks of k-Means
  10. hdbscan python implementation
  11. HDBSCAN research paper