pu4spark

A library for Positive-Unlabeled Learning for Apache Spark MLlib (ml package)

Implemented algorithms

Traditional PU

Original Positive-Unlabeled learning algorithm; firstly proposed in

Liu, B., Dai, Y., Li, X. L., Lee, W. S., & Philip, Y. (2002). Partially supervised classification of text documents. In ICML 2002, Proceedings of the nineteenth international conference on machine learning. (pp. 387–394).

Gradual Reduction PU (aka PU-LEA)

Modified Positive-Unlabeled learning algorithm; main idea is to gradually refine set of positive examples. Pseudocode was taken from:

Fusilier, D. H., Montes-y-Gómez, M., Rosso, P., & Cabrera, R. G. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443.

Requirements

Spark 1.5+

(Spark 2+ was not tested, but should work if replace SparkContext by SparkSession and mllib.linalg.Vector by ml.linalg.Vector)

Linking

The library is published into Maven central and JCenter. Add the following lines depending on your build system.

Gradle

compile 'ru.ispras:pu4spark:0.3'

Maven

<dependency>
    <groupId>ru.ispras</groupId>
    <artifactId>pu4spark</artifactId>
    <version>0.3</version>
</dependency>

SBT

libraryDependencies += "ru.ispras" % "pu4spark" % "0.3"

Building from Sources

Build library with gradle:

./gradlew jar

Usage example

val inputLabelName = "category"
val srcFeaturesName = "srcFeatures"
val outputLabel = "outputLabel"

val puLearnerConfig = TraditionalPULearnerConfig(0.05, 1, LogisticRegressionConfig())
val puLearner = puLearnerConfig.build()
val df = ... //needed df that contains at least the following columns:
// binary label for positive and unlabel (inputLabelName)
// and features assembled as vector (featuresName)

val weightedDF = puLearner.weight(preparedDf, inputLabelName, srcFeaturesName, outputLabel)

Returned dataframe contains probability estimation for each instance in the column outputLabel.

Features can be assembled to one column by using VectorAssembler:

val assembler = new VectorAssembler()
  .setInputCols(df.columns.filter(c => c != rowName)) //keep here only feature columns
  .setOutputCol(featuresName)
val pipeline = new Pipeline().setStages(Array(assembler))
val preparedDf = pipeline.fit(df).transform(df)