Tensorpack DataFlow is an efficient and flexible data loading pipeline for deep learning, written in pure Python.
Its main features are:
Highly-optimized for speed. Parallelization in Python is hard and most libraries do it wrong. DataFlow implements highly-optimized parallel building blocks which gives you an easy interface to parallelize your workload.
Written in pure Python. This allows it to be used together with any other Python-based library.
DataFlow is originally part of the tensorpack library and has been through 3 years of active development. Given its independence of the rest of the tensorpack library, and the high demand from users, it is now a separate library whose source code is synced with tensorpack.
Why would you want to use DataFlow instead of a platform-specific data loading solutions? We recommend you to read Why DataFlow?.
pip install --upgrade git+https://github.com/tensorpack/dataflow.git
# or add `--user` to install to user's local directories
You may also need to install opencv, which is used by many builtin DataFlows.
import dataflow as D
d = D.ILSVRC12('/path/to/imagenet') # produce [img, label]
d = D.MapDataComponent(d, lambda img: some_transform(img), index=0)
d = D.MultiProcessMapData(d, num_proc=10, lambda img, label: other_transform(img, label))
d = D.BatchData(d, 64)
d.reset_state()
for img, label in d:
# ...
Please send issues and pull requests (for the dataflow/
directory) to the
tensorpack project where the source code is developed.