Tensorpack DataFlow

Tensorpack DataFlow is an efficient and flexible data loading pipeline for deep learning, written in pure Python.

Its main features are:

  1. Highly-optimized for speed. Parallelization in Python is hard and most libraries do it wrong. DataFlow implements highly-optimized parallel building blocks which gives you an easy interface to parallelize your workload.

  2. Written in pure Python. This allows it to be used together with any other Python-based library.

DataFlow is originally part of the tensorpack library and has been through 3 years of active development. Given its independence of the rest of the tensorpack library, and the high demand from users, it is now a separate library whose source code is synced with tensorpack.

Why would you want to use DataFlow instead of a platform-specific data loading solutions? We recommend you to read Why DataFlow?.

Install:

pip install --upgrade git+https://github.com/tensorpack/dataflow.git
# or add `--user` to install to user's local directories

You may also need to install opencv, which is used by many builtin DataFlows.

Examples:

import dataflow as D
d = D.ILSVRC12('/path/to/imagenet')  # produce [img, label]
d = D.MapDataComponent(d, lambda img: some_transform(img), index=0)
d = D.MultiProcessMapData(d, num_proc=10, lambda img, label: other_transform(img, label))
d = D.BatchData(d, 64)
d.reset_state()
for img, label in d:
  # ...

Documentation:

Tutorials:

  1. Basics
  2. Why DataFlow?
  3. Write a DataFlow
  4. Parallel DataFlow
  5. Efficient DataFlow

APIs:

  1. Built-in DataFlows
  2. Built-in Datasets

Support & Contributing

Please send issues and pull requests (for the dataflow/ directory) to the tensorpack project where the source code is developed.