Skip to content

Commit 118d950

Browse files
authored
Merge pull request #8425 from tonyyang-svail/tonyyang-svail-patch-2
design doc for parallel_do.md
2 parents 9890bb5 + 8b24bd4 commit 118d950

File tree

1 file changed

+162
-0
lines changed

1 file changed

+162
-0
lines changed

doc/design/parallel_do.md

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Design Doc: Parallel_Do in PaddlePaddle
2+
3+
In PaddlePaddle, we use parallel_do primitive to represent multithread data parallel processing.
4+
5+
## Design overview
6+
7+
The definition of a parallel_do op looks like the following
8+
9+
```c++
10+
AddInput(kInputs, "Inputs needed to be split onto different devices").AsDuplicable();
11+
AddInput(kParameters, "Parameters are duplicated over different devices")
12+
.AsDuplicable();
13+
AddInput(kPlaces, "Devices used for parallel processing");
14+
AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable();
15+
AddOutput(kParallelScopes,
16+
"Scopes for all local variables in forward pass. One scope for each device");
17+
AddAttr<framework::BlockDesc *>(kParallelBlock,
18+
"List of operaters to be executed in parallel");
19+
```
20+
21+
A vanilla implementation of parallel_do can be shown as the following (`|` means single thread and
22+
`||||` means multiple threads)
23+
24+
```
25+
In the forward pass
26+
| Split input onto different devices
27+
| Copy parameter to onto different devices
28+
|||| Compute forward pass in parallel
29+
| Merge output from different devices
30+
31+
In the backward pass
32+
| Split output@grad onto different devices
33+
|||| Compute backward pass in parallel
34+
| accumulate param@grad from different devices to the first device
35+
| Merge input@grad from different devices
36+
 | Copy param@grad to the place of parallel_do_op
37+
```
38+
39+
This implementation allows to write mixed device program like this
40+
41+
```python
42+
# get embedding feature on CPU
43+
feature = some_cpu_only_op(data)
44+
45+
gpu_places = get_place(use_gpu=True)
46+
# parallel processing on multiple GPUs
47+
pd = ParallelDo(gpu_places)
48+
with pd.do():
49+
read_input(feature)
50+
prediction = my_net(feature)
51+
write_output(prediction)
52+
prediction = pd()
53+
loss = cross_entropy(prediction, label)
54+
```
55+
56+
And the programDesc are like the following
57+
58+
```
59+
# start_program will be run by executor(CPUPlace), all w1, w2 will be allocated on CPU
60+
start_program
61+
{
62+
vars: w1, w2
63+
ops: init(w1), init(w2)
64+
}
65+
66+
main_program
67+
{
68+
block0 {
69+
vars: data, places, w1, w2
70+
ops: data, get_place, parallel_do(block1),
71+
parallel_do_grad(block2),
72+
sgd(w2, w2_grad),
73+
sgd(w1, w1_grad)
74+
}
75+
block1 {
76+
parent_block: 0
77+
vars: data, h1, h2, loss
78+
ops: fc, fc, softmax
79+
}
80+
block2 {
81+
parent_block: 1
82+
vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
83+
ops: softmax_grad,
84+
fc_grad
85+
fc_grad
86+
}
87+
}
88+
```
89+
90+
## Proformance Imporvement
91+
92+
There are serial places we can make this parallel_do faster.
93+
94+
### forward: split input onto different devices
95+
96+
If the input of the parallel_do is independent from any prior opeartors, we can avoid this step by
97+
prefetching the input onto different devices in a seperate background thread. And the python code
98+
looks like this.
99+
```python
100+
pd = ParallelDo(gpu_places)
101+
with pd.do():
102+
   feature = get_data_from_prefetch_queue(gpu_places)
103+
prediction = my_net(feature)
104+
write_output(activation)
105+
```
106+
107+
### forward: Copy parameter to onto different devices
108+
109+
We can avoid this step by making each device have a copy of the parameter. This requires:
110+
111+
1. `fluid.default_start_up_program()` to be run on all devices
112+
1. In the backward, allreduce param@grad at different devices, this requires
113+
1. `backward.py` add `allreduce` operators at parallel_do_grad
114+
1. `allreduce` operators need to be called in async mode to achieve maximum throughput
115+
1. apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel
116+
117+
By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device".
118+
And the ProgramDesc looks like the following
119+
120+
```
121+
# w1, w2 will be allocated on all GPUs
122+
start_program
123+
{
124+
block0 {
125+
parallel_do(block1)
126+
}
127+
block1 {
128+
parent_block: 0
129+
vars: w1, w2
130+
ops: init(w1), init(w2)
131+
}
132+
}
133+
134+
main_program
135+
{
136+
block0 {
137+
vars: data, places, w1, w2
138+
ops: data, get_place, parallel_do(block1),
139+
parallel_do_grad(block2), # append_backward
140+
parallel_do(block3) # append_optimization
141+
142+
}
143+
block1 {
144+
parent_block: 0
145+
vars: data, h1, h2, loss
146+
ops: fc, fc, softmax
147+
}
148+
block2 {
149+
parent_block: 1
150+
vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
151+
ops: softmax_grad,
152+
fc_grad, allreduce(places, scopes, w1_grad),
153+
fc_grad, allreduce(places, scopes, w2_grad)
154+
}
155+
block3 {
156+
parent_block: 0
157+
vars: lr
158+
ops: sgd(w2, w2_grad),
159+
sgd(w1, w1_grad)
160+
}
161+
}
162+
```

0 commit comments

Comments
 (0)