Skip to content

Fluid data pipeline interface #10102

@helinwang

Description

@helinwang

The data pipeline (processing and reading) in a traditional deep learning framework is typically defined in Python scripts, utilizing many Python libraries. We propose a new data pipeline scheme based on Unix pipe, utilizing all executables, including Python scripts.

Read Data from stdin

A typical use case is a Fluid training program reads mini-batches from stdin:

$ cat ~/dataset/fit_a_line | paddle run train.py

In the above command, ~/dataset/fit_a_line is a file encoded with Fluid data format. It contains a sequence of training data entries. paddle run train.py runs a Fluid program that trains a neural network, reading the training data from stdin.

Here is an example of train.py:

from paddle import fluid

reader = fluid.reader("/dev/stdin")
with reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  avg_cost = fluid.layers.mean(cost)
  sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
  sgd_optimizer.minimize(avg_cost)

writer = fluid.writer(os.stdout)
fluid.save_parameters(writer=writer)

The code block inside the with statement will be iterated every mini-batch, until EOF is read from stdin.

Output Data to stdout

In the train.py above, the trained parameters is saved to stdout after the training iterations.:

writer = fluid.writer("/dev/stdout")
fluid.save_parameters(writer=writer)

We can have a test.py that reads the parameters from stdin and test the model accuracy:

from paddle import fluid

model_reader = fluid.reader("/dev/stdin")
test_reader = fluid.reader("~/dataset/testset/test")
writer = fluid.writer("/dev/stdout")
fluid.load_parameters(reader=model_reader)
with test_reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  writer.write(cost)

We can use this command to train, test and pretty print the test result:

$ cat ~/dataset/fit_a_line | paddle run train.py | paddle run test.py | paddle pretty_print

Interprocess Communicating with Non-Fluid Programs

Non-Fluid programs can communicate with Fluid programs as long as they understands the Fluid data format. We will provide APIs for Non-Fluid programs to read and write data to file descriptors.

In the code below, preprocess.py preprocess the data from stdin before writing it to stdout:

from paddle import fluid

w = fluid.write_to("/dev/stdout")
w.set_columns("feature", "label")
r = fluid.read_from("/dev/stdin")

for entry in r:
  feature = entry["feature"]
  label = entry["label"]
  # feature and label are both numpy.ndarray
  new_feature = some_preprocessing(feature)
  # new_feature is a numpy.ndarray, it will be serialized to a format
  # that can be deserialized to fluid.lod_tensor.
  w.write(new_feature, label)

We can then chain it to our training process:

$ cat ~/dataset/fit_a_line | python preprocess.py | paddle run train.py

We will provide the equivalent C++ API so the user can write high efficient C++ programs that interact with Fluid programs.

Run Fluid Program from Python

All the above examples run Fluid program inside bash with paddle run *.py. We can run Fluid from Python as well, fully compatible with the Fluid data pipeline interface. Please see examples in #9912.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions