AMIDST Toolbox (


GitHub version Build Status Codacy Badge License


Probabilistic Machine Learning

The AMIDST Toolbox allows you to model your problem using a flexible probabilistic language based on graphical models. Then you fit your model with data using a Bayesian approach to handle modeling uncertainty.

Multi-core and distributed processing

AMIDST provides tailored parallel (powered by Java 8 Streams) and distributed (powered by Flink or Spark) implementations of Bayesian parameter learning for batch and streaming data. This processing is based on flexible and scalable message passing algorithms.


Simple Code Example

Fitting a model with local data

        //Load the data
        String filename = "./data.arff";
        DataStream<DataInstance> data =;

        //Learn the model
        Model model = new CustomGaussianMixture(data.getAttributes());


        // Save with .bn format, "./");

Fitting a model with distributed data

        //Load the data
        String filename = "hdfs://dataDistributed.arff";
        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataFlink<DataInstance> data = DataFlinkLoader.loadDataFromFolder(env, filename, false);

        //Learn the model
        Model model = new CustomGaussianMixture(data.getAttributes());


        // Save with .bn format, "./");

Real-World Uses Cases

Risk prediction in credit operations

AMIDST Toolbox has been used to track concept drift and do risk prediction in credit operations, and as data is collected continuously and reported on a daily basis, this gives rise to a streaming data classification problem. This work has been performed in collaboration with one of our partners, the Spanish bank BCC. It is expected to be into production at the beginning of 2017.

Recognition of traffic maneuvers

AMIDST Toolbox has been used to prototype models for early recognition of traffic maneuver intentions. Similarly to the previous case, data is continuously collected by car on-board sensors giving rise to a large and quickly evolving data stream. This work has been performed in collaboration with one of our partners, DAIMLER.



Multi-Core Scalablity using Java 8 Streams

Scalability is a main concern for the AMIDST toolbox. Java 8 streams are used to provide parallel implementations of our learning algorithms. If more computation capacity is needed to process data, AMIDST users can also use more CPU cores. As an example, the following figure shows how the data processing capacity of our toolbox increases given the number of CPU cores when learning an a probabilistic model (including a class variable C, two latent variables (dashed nodes), multinomial (blue nodes) and Gaussian (green nodes) observable variables) using the AMIDST's learning engine. As can be seen, using our variational learning engine, AMIDST toolbox is able to process data in the order of gigabytes (GB) per hour depending on the number of available CPU cores with large and complex PGMs with latent variables. Note that, these experiments were carried out on a Ubuntu Linux server with a x86_64 architecture and 32 cores. The size of the processed data set was measured according to the Weka's ARFF format.

Distributed Scalablity using Apache Flink

If your data is really big and can not be stored in a single laptop, you can also learn your probabilistic model on it by using the AMIDST distributed learning engine based on a novel and state-of-the-art distributed message passing scheme implemented on top of Apache Flink. As detailed in this paper, we were able to perform inference in a billion node (i.e. 10^9) probabilistic model in an Amazon's cluster with 2, 4, 8 and 16 nodes, each node containing 8 processing units. The following figure shows the scalability of our approach under these settings.

Spark Link Module on AMIDST

This module integrates the functionality of the AMIDST toolbox with the Apache Spark platform.

The following functionality is already implemented on the sparklink module:

More information here

Publications & Use-Cases

The following repository contains the source code and details about the publications and use-cases using the AMIDST toolbox.

Upcoming Developments

The AMIDST toolbox is an expanding project and upcoming developments include for instance the ongoing integration of the toolbox in Spark to enlarge its scalability capacities. In addition, a new link to R is still in progress which will expand the AMIDST user-base.

Contributing to AMIDST

AMIDST is an open source toolbox and the end-users are encouraged to upload their contributions (which may include basic contributions, major extensions, and/or use-cases) following the indications given in this link.

 Acknowledgements and License

This software was performed as part of the AMIDST project. AMIDST has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 619209.

This software is distributed under Apache License Version 2.0