Fluid data pipeline interface

The data pipeline (processing and reading) in a traditional deep learning framework is typically defined in Python scripts, utilizing many Python libraries. We propose a new data pipeline scheme based on Unix pipe, utilizing all executables, including Python scripts.

### Read Data from stdin

A typical use case is a Fluid training program reads mini-batches from stdin:

```bash
$ cat ~/dataset/fit_a_line | paddle run train.py
```

In the above command, `~/dataset/fit_a_line` is a file encoded with Fluid data format. It contains a sequence of training data entries. `paddle run train.py` runs a Fluid program that trains a neural network, reading the training data from stdin.

Here is an example of `train.py`:

```Python
from paddle import fluid

reader = fluid.reader("/dev/stdin")
with reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  avg_cost = fluid.layers.mean(cost)
  sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
  sgd_optimizer.minimize(avg_cost)

writer = fluid.writer(os.stdout)
fluid.save_parameters(writer=writer)
```

The code block inside the with statement will be iterated every mini-batch, until EOF is read from stdin.

### Output Data to stdout

In the `train.py` above, the trained parameters is saved to stdout after the training iterations.:
```Python
writer = fluid.writer("/dev/stdout")
fluid.save_parameters(writer=writer)
```

We can have a `test.py` that reads the parameters from stdin and test the model accuracy:

```Python
from paddle import fluid

model_reader = fluid.reader("/dev/stdin")
test_reader = fluid.reader("~/dataset/testset/test")
writer = fluid.writer("/dev/stdout")
fluid.load_parameters(reader=model_reader)
with test_reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  writer.write(cost)
```

We can use this command to train, test and pretty print the test result:
```bash
$ cat ~/dataset/fit_a_line | paddle run train.py | paddle run test.py | paddle pretty_print
```

### Interprocess Communicating with Non-Fluid Programs

Non-Fluid programs can communicate with Fluid programs as long as they understands the Fluid data format. We will provide APIs for Non-Fluid programs to read and write data to file descriptors.

In the code below, `preprocess.py` preprocess the data from stdin before writing it to stdout:

```Python
from paddle import fluid

w = fluid.write_to("/dev/stdout")
w.set_columns("feature", "label")
r = fluid.read_from("/dev/stdin")

for entry in r:
  feature = entry["feature"]
  label = entry["label"]
  # feature and label are both numpy.ndarray
  new_feature = some_preprocessing(feature)
  # new_feature is a numpy.ndarray, it will be serialized to a format
  # that can be deserialized to fluid.lod_tensor.
  w.write(new_feature, label)
```

We can then chain it to our training process:

```bash
$ cat ~/dataset/fit_a_line | python preprocess.py | paddle run train.py
```

We will provide the equivalent C++ API so the user can write high efficient C++ programs that interact with Fluid programs.

### Run Fluid Program from Python

All the above examples run Fluid program inside bash with `paddle run *.py`. We can run Fluid from Python as well, fully compatible with the Fluid data pipeline interface. Please see examples in https://github.com/PaddlePaddle/Paddle/issues/9912.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fluid data pipeline interface #10102

Read Data from stdin

Output Data to stdout

Interprocess Communicating with Non-Fluid Programs

Run Fluid Program from Python

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fluid data pipeline interface #10102

Description

Read Data from stdin

Output Data to stdout

Interprocess Communicating with Non-Fluid Programs

Run Fluid Program from Python

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions