An open source project from Data to AI Lab at MIT.
Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators for tabular data. SDGym is a project of the Data to AI Laboratory at MIT.
A Synthetic Data Generator is a Python function (or class method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has similar mathematical properties as the real one.
Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones included in SDGym and see the current leaderboard.
SDGym evaluates the performance of Synthetic Data Generators using datasets that are in three families:
This is a summary of the current SDGym leaderboard, showing the number of datasets in which each Synthesizer obtained the best score.
Detailed leaderboard results for all the releases are available in this Google Docs Spreadsheet.
SDGym has been developed and tested on Python 3.5, and 3.6
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDGym is run.
The easiest and recommended way to install SDGym is using pip:
pip install sdgym
This will pull and install the latest stable release from PyPi.
If you want to install it from source or contribute to the project please read the Contributing Guide for more details about how to do it.
All you need to do in order to use the SDGym Benchmark, is to import and call the
function passing it your synthesizer function and the settings that you want to use for the
For example, if we want to evaluate a simple synthesizer function in the
we can execute:
import numpy as np from sdgym import benchmark def my_synthesizer_function(real_data, categorical_columns, ordinal_columns): """dummy synthesizer that just returns a permutation of the real data.""" return np.random.permutation(real_data) scores = benchmark(synthesizers=my_synthesizer_function, datasets=['census'])
The output of the
benchmark function will be a
pd.DataFrame containing the results obtained
by your synthesizer on each dataset, as well as the results obtained previously by the SDGym
adult/accuracy adult/f1 ... ring/test_likelihood IndependentSynthesizer 0.56530 0.134593 ... -1.958888 UniformSynthesizer 0.39695 0.273753 ... -2.519416 IdentitySynthesizer 0.82440 0.659250 ... -1.705487 ... ... ... ... ... my_synthesizer_function 0.64865 0.210103 ... -1.964966
If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the
corresponding class, or a list of classes, to the
For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (this will take a lot of time to run!):
from sdgym.synthesizers import ( CLBNSynthesizer, CTGANSynthesizer, IdentitySynthesizer, IndependentSynthesizer, MedganSynthesizer, PrivBNSynthesizer, TableganSynthesizer, TVAESynthesizer, UniformSynthesizer, VEEGANSynthesizer) all_synthesizers = [ CLBNSynthesizer, IdentitySynthesizer, IndependentSynthesizer, MedganSynthesizer, PrivBNSynthesizer, TableganSynthesizer, CTGANSynthesizer, TVAESynthesizer, UniformSynthesizer, VEEGANSynthesizer, ] scores = benchmark(synthesizers=all_synthesizers)
For further details about all the arguments and possibilities that the
benchmark function offers
please refer to the benchmark documentation
SDV, for Synthetic Data Vault, is the end-user library for synthesizing data in development under the HDI Project. SDV allows you to easily model and sample relational datasets using Copulas through a simple API. Other features include anonymization of Personal Identifiable Information (PII) and preserving relational integrity on sampled records.
CTGAN is the GAN based model for synthesizing tabular data presented in the Modeling Tabular data using Conditional GAN paper. It's also developed by the MIT's Data to AI Lab and is under active development.