P3-BatchRefine provides methods to run OpenRefine in batch mode. It does so by providing a collection of wrappers (called backends) and a distribution layer on top of OpenRefine.
Clients can access the backends by two ways: using a commandline client or using an HTTP API based on the Fusepool P3 transformer API. The latter allows BatchRefine to take part in P3 pipelines where it can be chained with other transformers.
In either case, two things are needed to run BatchRefine:
To try BatchRefine right away, use the pre-built docker image
docker run --rm -it -p 8310:8310 fusepool/p3-batchrefine
This will start the P3 Batchrefine transformer with default configurations, which can be accessed as follows:
curl -XPOST -H 'Content-Type:text/csv' --data-binary @input.csv 'localhost:8310/?refinejson=http://url.to/transform.json'
Building BatchRefine from sources requires Maven 3 and Apache ant (for building OpenRefine). The procedure, which is somewhat complex because OpenRefine is not meant to be used as a library, is as follows. In a clean folder:
Download the OpenRefine 2.6-beta.1 source distribution from:
Unzip, untar, and then build OpenRefine, the server and web app JARs by running:
ant build jar_server jar_webapp
Switch to the
./extensions folder under the OpenRefine root and
then download the OpenRefine RDF extension alpha 0.9.0 source
Unzip, untar, and then rename the folder it extracts into
rdf-extension and build it as follows:
mv grefine-rdf-extension-0.9.0 rdf-extension cd rdf-extension JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF-8' ant build
After that, switch back to the OpenRefine root and start it (
running instance is required for the tests that BatchRefine will
run during the build.
Download BatchRefine from:
into a sibling folder to OpenRefine (i.e. both OpenRefine and
BatchRefine should share the same parent folder). As usual, unzip
and untar. Switch to the
p3-batchrefine-v1.x.x folder, and run:
./bin/refine-import.sh mvn package
The JAR for starting the P3 transformer will be located under:
whereas the JAR for starting the command line client will be under:
This section describes how to run the tools, for more details refer to Usage section.
Run the Command Line Tool
./bin/batchrefine [--verbose] BACKEND_TYPE [backend_specific_options] INPUTFILE TRANSFORM [OUTPUTFILE]
OUTPUTFILE is specified, writes to
Available backends: remote - simple http client that connects to an OpenRefine instance split - distributed backend able to connect to multiple OpenRefine instances and improve performance by splitting input file. embedded - built-in OpenRefine allows to run transforms without starting an external OpenRefine instance (currently has limited functionality) spark - distributed backend based on Apache Spark aimed at very large workloads (currently has limited functionality)
To list the
./bin/batchrefine BACKEND_TYPE --help
Run the P3 Transformer
./bin/transformer [TRANSFORMER_OPTIONS] BACKEND_TYPE [backend_specific_options]
-v -- verbose logging -p [PORT] -- port to which transformer listens (defaults: 8310) -t [sync|async] -- transformer type: synchronous or asynchronous (defaults to sync)
Available backends for the transformer are: remote, split, spark
backend_specific_options are the same as for the command line client and can be listed with
--help option or, consult the Usage section
To start the most common configuration of the transformer (running synchronously on port 8310 and connecting to a locally running instance of OpenRefine):
./bin/transformer remote #which is equivalent to: ./bin/transformer -v -t sync -p 8310 remote -l localhost:3333
Unfortunately, the command line tool has to be built from sources. Read the section on building BatchRefine from sources for instructions on how to do it.
The HTTP API is convenient for integrating BatchRefine as a service, but clumsy for manual usage. The command line tool works better in these cases, as you can simply do:
./bin/batchrefine remote input.csv transform.json > output.csv
where, as before,
input.csv is the input file,
the transform script and
output.csv is the output file to which to
write the transformed data.
We ship a prepackaged script to start the command line tool under
./bin. We will show an example using the embedded backend so that
you do not need to start OpenRefine to actually use it.
./bin/batchrefine embedded input.csv transform.json
this will produce a CSV file on stdout with the transform applied to it.
The embedded engine cannot currently do reconciliation, and extensions require customization to work (i.e. the RDF extension won't work out of the box). Further, it is likely that it has to be altered or rewritten to work with newer versions of OpenRefine.
If you get JSON exeptions like
org.json.JSONException check the file size of your input JSON file. BatchRefine does not accept large input JSON files and you might have to shrink it to a few 100kB to get rid of it. This can be done by not selecting the whole history in OpenRefine, which can make the configuration really big (a few megabytes).
The command line tool can also act as a direct client to a running
OpenRefine instance. If you have OpenRefine running on
refine.example.com:3333, you can use the command line client as follows:
./bin/batchrefine remote -l refine.example.com:3333 input.csv transform.json
The command line tool can also split a large file for you and submit it to multiple OpenRefine instances. For example, you have two OpenRefine instances and you want to split your file in half:
./bin/batchrefine split -l refine.example.com:3333,refine1.example.com:3333 -s CHUNK:2 input.csv transform.json
split backend will split an input file in 2 chunks, upload them to available OpenRefine
instances and handle the reassembling of the result.
To get the list of available options, use
./bin/batchrefine split --help
--help : Prints usage information -c (--config) config.properties : Load batchrefine config from properties file -f (--format) [csv | rdf | turtle] : The format in which to output the transformed data -h (--hosts) localhost : OpenRefine instances hosts -s (--split) [LINE:int | CHUNK:int] : Set default split logic
Two split strategies are supported:
LINE:30,50,80will split a file into 4 pieces on exectly specified lines.
The BatchRefine P3 transformer wraps (multiple instances of) OpenRefine under the Fusepool P3 HTTP API. We will show how to build a transformer that operates over a single instance, running locally.
use the Dockerfile we provide;
use our wrapper script. At the BatchRefine source root, run:
cd docker ./batchrefine-docker.sh bootstrap
After running the bootstrap step, you just have to run:
For more information regarding docker, refer to the docker README
and this will expose a synchronous BatchRefine [P3 transformer]() on port 8310. To access the transformer, you have to make a POST request to it.
Docker image provides a running OpenRefine instance together with the transformer so you don't have to care about running your own.
./bin/transformer -v -t sync remote -l refine.example.com:3333
Will start a synchronous P3 Transformer which will connect to the specified OpenRefine instance.
If no URI is specified, defaults to:
As per the P3 transformer API, the input file goes in the body of the
POST request, whereas the transform script goes as an URI passed as a
query parameter called
refinejson in our case. Assuming our input
file is called
input.csv and is available locally, and our transform
script is called
transform.json and is available at
http://www.example.org/transform.json, we could do a request like:
curl -XPOST --data-binary @input.csv --H 'Content-Type:text/csv' -H 'Accept:text/csv' 'http://localhost:8310?refinejson=http://www.example.org/transform.json'
to which the transformer will reply with a CSV file that has been
transformed according to what is described in
NB: Although transform scripts can be taken from local URIs such as
file://tmp/transform.json, BatchRefine won't be able to access them
when running inside Docker. If you want to post
file URIs, it's
best to build and run the transformer from sources (see the section
on building BatchRefine from sources).
This work is partially funded by Fusepool P3 project, under FP7 grant 609696.