BART (Benchmarking Algorithms for data Repairing and Translation) is an error-generation tool for data cleaning applications. Its purpose is to introduce errors into clean databases for the purpose of benchmarking data-repairing algorithms. It provides users with the highest possible level of control over the error-generation process, and at the same time scales nicely to large databases. This is far from trivial, since, as we show in our technical papers, the error-generation problem is surprisingly challenging, and in fact, NP-complete. To scale to millions of tuples, the system relies on several non-trivial optimizations, including a new symmetry property of data quality constraints.
Additional material about the project (papers and example datasets) can be found at the following address: http://db.unibas.it/projects/bart/
Execute script ./run <egtask.xml>
, for example ./run.sh misc/resources/employees/employees-dbms-2k-egtask.xml
An EGTask is specified in an .xml file (here a template), with the following sections:
Is used to specify the JDBC parameters to access the DBMS. PostgreSQL and H2 DBMS are supported. Data can be automatically loaded into the database from XML and CSV files.
How to evaluate a data-cleaning tool
Bart_Engine
gfp
, either using command-line ant gfp
, or using NetBeans (in the projects windows, right click on build.xml -> Run Target -> Other Targets -> gfp)