Skip to content

design doc for implementation parameters in CPP. #2249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions doc/design/parameters_in_cpp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Design Doc: The C++ Class `Parameters`

`Parameters` is a concept we designed in Paddle V2 API. `Parameters` is a container of parameters, and make Paddle can shared parameter between topologies. We described usages of `Parameter` in [api.md](./api.md).

We used Python to implement Parameters when designing V2 API before. There are several defects for current implementation:
* We just use `memcpy` to share Parameters between topologies, but this is very inefficient.
* We did not implement share Parameters while training. We just trigger `memcpy` when start training.

It is necessary that we implement Parameters in CPP side. However, it could be a code refactoring for Paddle, because Paddle was designed for training only one topology before, i.e., each GradientMachine contains its Parameter as a data member. In current Paddle implementation, there are three concepts associated with `Parameters`:

1. `paddle::Parameter`. A `Parameters` is a container for `paddle::Parameter`.
It is evident that we should use `paddle::Parameter` when developing `Parameters`.
However, the `Parameter` class contains many functions and does not have a clear interface.
It contains `create/store Parameter`, `serialize/deserialize`, `optimize(i.e SGD)`, `randomize/zero`.
When we developing `Parameters`, we only use `create/store Parameter` functionality.
We should extract functionalities of Parameter into many classes to clean Paddle CPP implementation.

2. `paddle::GradientMachine` and its sub-classes, e.g., `paddle::MultiGradientMachine`, `paddle::NeuralNetwork`.
We should pass `Parameters` to `paddle::GradientMachine` when `forward/backward` to avoid `memcpy` between topologies.
Also, we should handle multi-GPU/CPU training, because `forward` and `backward` would perform on multi-GPUs and multi-CPUs.
`Parameters` should dispatch the parameter value to each device, and gather the parameter gradient from each device.

3. `paddle::ParameterUpdater`. The ParameterUpdater is used to update parameters in Paddle.
So `Parameters` should be used by `paddle::ParameterUpdater`, and `paddle::ParameterUpdater` should optimize `Parameters` (by SGD).


The step by step approach for implementation Parameters in Paddle C++ core is listed below. Each step should be a PR and could be merged into Paddle one by one.

1. Clean `paddle::Parameter` interface. Extract the functionalities of `paddle::Parameter` to prepare for the implementation of Parameters.

2. Implementation a `Parameters` class. It just stores the `paddle::Parameter` inside. Make `GradientMachine` uses `Parameters` as a class member.

3. Make `Parameters` support Multi-CPU and Multi-GPU training to prepare for sharing `Parameter` between topologies.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it that GradientMachine and NeuralNetwork does single-thread training, and MultiGradientMachine does concurrent training? If so, it seems that it is the responsibility of MultiGradientMachine, other than Parameters, to sync up among threads.

如果 GradientMachine, NeuralNetwork, MultiGradientMachine 都用到 Parameters,但是只有 MultiGradientMachine做并发训练,前两个classes都不做,那么对多线程的支持应该在 MultiGradientMachine 里面,而不应该在 Parameters 里面。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiGradientMachine can support only one topology while training. But Parameters may be shared by many topologies. I think MultiGradientMachine should invoke Parameters.exchangeToMultiGPU(used_parameter_names) method when MultiGradientMachine is used only a subset of Parameters. Or let another class to do exchange job, such as exchanger = new ParameterExchanger(parameters, used_parameter_names), exchanger.exchange();

Another reason I want to extract Parameter Exchange/Gather logic from MultiGradientMachine is the MultiGradientMachine is a super class. It mixes the Multi-Devices computing logic, Parameter Exchange/Gather logic, synchronization together in a Single Class. It should be better and clearer that we extract some logic.

原因有如下几点:
1、MultiGradientMachine只处理了训练一个拓扑结构的情况,而Parameters可能在训练中被多个拓扑结构共享。于是多卡参数交换就和单个拓扑结构情况下不同(不是所有参数都要进行交换,而是选择某些参数进行交换)。当然,还是应该由MultiGradientMachine调用Parameters.exchange进行交换。
可能的实现手法是:

// ParameterExchanger负责参数交换的全部逻辑
auto exchanger = new ParameterExchanger(parameters, used_parameter_names);
exchanger.exchange();

2、另一个想要把参数交换逻辑提取出来的原因是,MultiGradientMachine是一个非常重的类,揉和了多个功能。例如多设备的计算,参数聚合分发,同步逻辑等等。如果我们在写Parameters的时候,把参数聚合逻辑分解出来,会让代码逻辑变得更清晰。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe global function is better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added this part into design doc.

Because we need share `Parameters` between topologies, it is `Parameters`'s response to exchange Parameters between GPUs.
`GradientMachine` should not handle how to exchange Parameters because `GradientMachine` only used to train one topology and we need to support train many topologies in Paddle, i.e., there could be many GradientMachines use one `Parameters`.
* We should use a global function to exchange Parameters between GPUs, not a member function in `Parameters`. The `MultiGradientMachine` invoke this function, which uses `Parameters` as this function inputs.
* The MultiGradientMachine contains many functionalities. Extracting the Parameters exchanging logic could make MultiGradientMachine clearer and simpler.

4. Make `Parameters` as an argument for `forward/backward` function, not a data member for `GradientMachine`. For example, `forward` could be `forward(const Parameters& params, ...)` and `backward` could be `backward(Parameters* params, ...)`. After this step, Paddle could share `Parameters` between topologies.

5. `ParameterUpdater` is invoked by `GradientMachine` and `Trainer`, but it updates `Parameters`. In the end of this code refactoring, we could change `ParameterUpdater` directly uses `Parameters` to make `ParameterUpdater`'s implementation clear.