Skip to content

Efficient Data Loading Pipeline in Pure Python

License

Notifications You must be signed in to change notification settings

tensorpack/dataflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tensorpack DataFlow

Tensorpack DataFlow is an efficient and flexible data loading pipeline for deep learning, written in pure Python.

Its main features are:

  1. Highly-optimized for speed. Parallelization in Python is hard and most libraries do it wrong. DataFlow implements highly-optimized parallel building blocks which gives you an easy interface to parallelize your workload.

  2. Written in pure Python. This allows it to be used together with any other Python-based library.

DataFlow is originally part of the tensorpack library and has been through many years of polishing. Given its independence of the rest of the tensorpack library, it is now a separate library whose source code is synced with tensorpack. Please use tensorpack issues for support.

Why would you want to use DataFlow instead of a platform-specific data loading solutions? We recommend you to read Why DataFlow?.

Install:

pip install --upgrade git+https://github.com/tensorpack/dataflow.git
# or add `--user` to install to user's local directories

You may also need to install opencv, which is used by many builtin DataFlows.

Examples:

import dataflow as D
d = D.ILSVRC12('/path/to/imagenet')  # produce [img, label]
d = D.MapDataComponent(d, lambda img: some_transform(img), index=0)
d = D.MultiProcessMapData(d, num_proc=10, lambda img, label: other_transform(img, label))
d = D.BatchData(d, 64)
d.reset_state()
for img, label in d:
  # ...

Documentation:

Tutorials:

  1. Basics
  2. Why DataFlow?
  3. Write a DataFlow
  4. Parallel DataFlow
  5. Efficient DataFlow

APIs:

  1. Built-in DataFlows
  2. Built-in Datasets

Support & Contributing

Please send issues and pull requests (for the dataflow/ directory) to the tensorpack project where the source code is developed.

About

Efficient Data Loading Pipeline in Pure Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages