Tensorpack DataFlow is an efficient and flexible data loading pipeline for deep learning, written in pure Python.
Its main features are:
Highly-optimized for speed. Parallelization in Python is hard and most libraries do it wrong. DataFlow implements highly-optimized parallel building blocks which gives you an easy interface to parallelize your workload.
Written in pure Python. This allows it to be used together with any other Python-based library.
DataFlow is originally part of the tensorpack library and has been through 3 years of active development. Given its independence of the rest of the tensorpack library, and the high demand from users, it is now a separate library whose source code is synced with tensorpack.
Why would you want to use DataFlow instead of a platform-specific data loading solutions? We recommend you to read Why DataFlow?.
pip install --upgrade git+https://github.com/tensorpack/dataflow.git # or add `--user` to install to user's local directories
You may also need to install opencv, which is used by many builtin DataFlows.
import dataflow as D d = D.ILSVRC12('/path/to/imagenet') # produce [img, label] d = D.MapDataComponent(d, lambda img: some_transform(img), index=0) d = D.MultiProcessMapData(d, num_proc=10, lambda img, label: other_transform(img, label)) d = D.BatchData(d, 64) d.reset_state() for img, label in d: # ...
Please send issues and pull requests (for the
dataflow/ directory) to the
tensorpack project where the source code is developed.