PySpark2PMML

Python library for converting Apache Spark ML pipelines to PMML.

Features

This package provides Python wrapper classes and functions for the JPMML-SparkML library. For the full list of supported Apache Spark ML Estimator and Transformer types, please refer to JPMML-SparkML documentation.

Prerequisites

Installation

Install a release version from PyPI:

pip install pyspark2pmml

Alternatively, install the latest snapshot version from GitHub:

pip install --upgrade git+https://github.com/jpmml/pyspark2pmml.git

Configuration and usage

PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:

Apache Spark version JPMML-SparkML branch JPMML-SparkML uber-JAR file
2.0.X 1.1.X (Archived) 1.1.23
2.1.X 1.2.X (Archived) 1.2.15
2.2.X 1.3.X (Archived) 1.3.15
2.3.X 1.4.X 1.4.15
2.4.X 1.5.X 1.5.8
3.0.X master 1.6.0

Launch PySpark; use the --jars command-line option to specify the location of the JPMML-SparkML uber-JAR file:

pyspark --jars /path/to/jpmml-sparkml-executable-${version}.jar

Fitting an example pipeline model:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

Exporting the fitted example pipeline model to a PMML file:

from pyspark2pmml import PMMLBuilder

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
    .putOption(classifier, "compact", True)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

License

PySpark2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use PySpark2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes PySpark2PMML available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io