-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
The data pipeline (processing and reading) in a traditional deep learning framework is typically defined in Python scripts, utilizing many Python libraries. We propose a new data pipeline scheme based on Unix pipe, utilizing all executables, including Python scripts.
Read Data from stdin
A typical use case is a Fluid training program reads mini-batches from stdin:
$ cat ~/dataset/fit_a_line | paddle run train.pyIn the above command, ~/dataset/fit_a_line is a file encoded with Fluid data format. It contains a sequence of training data entries. paddle run train.py runs a Fluid program that trains a neural network, reading the training data from stdin.
Here is an example of train.py:
from paddle import fluid
reader = fluid.reader("/dev/stdin")
with reader.iterate():
x = fluid.layers.data(name="feature")
label = fluid.layers.data(name="label")
y_predict = fluid.layers.fc(input=x, size=1, act=None)
cost = fluid.layers.square_error_cost(input=y_predict, label=label)
avg_cost = fluid.layers.mean(cost)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_cost)
writer = fluid.writer(os.stdout)
fluid.save_parameters(writer=writer)The code block inside the with statement will be iterated every mini-batch, until EOF is read from stdin.
Output Data to stdout
In the train.py above, the trained parameters is saved to stdout after the training iterations.:
writer = fluid.writer("/dev/stdout")
fluid.save_parameters(writer=writer)We can have a test.py that reads the parameters from stdin and test the model accuracy:
from paddle import fluid
model_reader = fluid.reader("/dev/stdin")
test_reader = fluid.reader("~/dataset/testset/test")
writer = fluid.writer("/dev/stdout")
fluid.load_parameters(reader=model_reader)
with test_reader.iterate():
x = fluid.layers.data(name="feature")
label = fluid.layers.data(name="label")
y_predict = fluid.layers.fc(input=x, size=1, act=None)
cost = fluid.layers.square_error_cost(input=y_predict, label=label)
writer.write(cost)We can use this command to train, test and pretty print the test result:
$ cat ~/dataset/fit_a_line | paddle run train.py | paddle run test.py | paddle pretty_printInterprocess Communicating with Non-Fluid Programs
Non-Fluid programs can communicate with Fluid programs as long as they understands the Fluid data format. We will provide APIs for Non-Fluid programs to read and write data to file descriptors.
In the code below, preprocess.py preprocess the data from stdin before writing it to stdout:
from paddle import fluid
w = fluid.write_to("/dev/stdout")
w.set_columns("feature", "label")
r = fluid.read_from("/dev/stdin")
for entry in r:
feature = entry["feature"]
label = entry["label"]
# feature and label are both numpy.ndarray
new_feature = some_preprocessing(feature)
# new_feature is a numpy.ndarray, it will be serialized to a format
# that can be deserialized to fluid.lod_tensor.
w.write(new_feature, label)We can then chain it to our training process:
$ cat ~/dataset/fit_a_line | python preprocess.py | paddle run train.pyWe will provide the equivalent C++ API so the user can write high efficient C++ programs that interact with Fluid programs.
Run Fluid Program from Python
All the above examples run Fluid program inside bash with paddle run *.py. We can run Fluid from Python as well, fully compatible with the Fluid data pipeline interface. Please see examples in #9912.