Skip to content

Plan to develop the inference library of fluid #7145

Closed
@Xreki

Description

@Xreki

Based on my experience in implementing a simple C++ inference example for fluid #7097 , I get a basic understanding of the training and inferring process of fluid.

To implement the inference library for fluid, we need discussion on following things:

Framework Design

In fluid, the inference process is composed of 2 phases:

  • Creation phase. In fluid, two ProgramDescs are needed for inference: one defines the inference network, namely inference_program; the other defines how and where to load the parameters, namely load_program.
    • Basically, proto string of inference network, feed_var_names and fetch_var_names are needed to create the two ProgramDesc.
    • No Tensors or models are created in this phase, inference_program and load_program just hold the protobuf message information of the inference network and initializing the network, respectively.
  • Execution phase. In fluid, users can switch among different execution environments flexibility. There are two steps:
    • Configuration step, allowing users to set different configuration, such as different devices (CPU or GPU), different runtime setting (multi-threads or multi-devices), etc.. In fact, this step is used to initialize the execution environment for inference network.

      • Once the configuration step is completed, users can run inference for many-times.
      • Users can initialize several different execution environments for the same inference network.
      • The loading of parameters will be done in this step too, by running the load_program.
    • Running step, allowing users to feed different data, do inference and fetch the predicted results.

      • Supporting the running in both synchronous and asynchronous way.

      inference

As a result, there should be at least three key concepts in fluid's C++ API:

  • InferenceDesc, to hold the handle of inference_program and load_program, which can be initialized from file or from buffer.

  • Tensor, an easy-to-use data structure for users. The Tensor and LoDTensor of fluid are too complicated for users. We do not need delayed memory allocation in this structure. Use Tensor and LoDTensor directly in user codes.

  • Execution, to hold the configuration of the execution environment.

    [NEED MORE DETAIL]

Refine the Storing Format of Inference Network

Currently, models trained by v2 api cannot be read in C-API directly. We need extra steps:

  • Remove the label, cost and evaluator layers in the network manually to get an inference network
  • Use dumpy_config.py to get the serialized config, or use merge_model.py to get a merged model (the serialized config and all parameter files are merged into a single file)

In fluid, we hope the model stored in the training process can be used in c++ inference code directly. Currently, there are a couple of interfaces, fluid.io.save_infernece_model and fluid.io.load_inference_model which are specially designed for inference. However, the storing format needs to be refined.

  • fluid.io.save_inference_model uses pickle.dump to store program_desc_str, feed_var_names and fetch_var_names information. It is difficult to support pickle in C++ code without third-party tools (like PicklingTools). It's better to design another way to save inference model, like:
    • design a new protobuf data structure for inference
    • or make feed_var_names and fetch_var_names a member of ProgramDesc
    • or store feed_var_names, fetch_var_names and program_desc_str sequentially and use some keywords to separate
  • fluid.io.save_inference_model save the serialized protobuf message of the network and all parameter variables into separated files in the same directory. However, many users may hope to save all the parameters into a single file and init the inference model from a buffer. We should enable to merge all the parameter files into one. This may lead to modification of some operators, such as load_op.

Besides, the storing formats of train and inference are different. Users need to save two copies if they want to use the results to fine-tune and infer at the same time. I am not sure whether it is better to unify the storing format.

Optimize the Inference Program

  • As shown in #7097, many Variables create, such as velocity_*, learning_rate_* and *@GRAD, but not referenced in any op. There are many unreferenced Variable in the inference program. Prune is called to remove the unreachable operators in the inference program which should remove the unreferenced Variables at the same time.

Compiling Aspects

  • All fluid's core C++ binaries should be contained in a single static or shared library, named like libpaddle_fluid.a or libpaddle_fluid.so.
    • libpaddle_fluid.so should link all the dependent libraries but limits the symbol-table at the same time as C-API does.
    • libpaddle_fluid.a should not contain any binaries of third-party libraries (gflags, glog, ...).
  • Maybe we need to support the compiler (gcc 4.8.2) that commonly deployed on our developing servers.

Metadata

Metadata

Labels

预测原名Inference,包含Capi预测问题等

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions