[Kunlun]PR3: add xpu executor, multi xpu card train function optimization #30317

vslyu · 2021-01-11T13:34:47Z

PR types

Function optimization

PR changes

Others

Describe

add xpu executor, multi xpu card train function optimization

cmake build flags:

cmake -DWITH_XPU_BKCL=ON -DCMAKE_INSTALL_PREFIX=./output/ -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DWITH_MKL=OFF -DWITH_XPU=ON -DWITH_GPU=OFF -DWITH_FLUID_ONLY=ON -DPY_VERSION=3.7  -DWITH_TESTING=ON -DWITH_STYLE_CHECK=ON  ../

examples:

export FLAGS_selected_xpus=0,1

import numpy
import os
import paddle
import paddle.fluid as fluid
paddle.enable_static()

use_xpu = True
place = fluid.XPUPlace(0) if use_xpu else fluid.CPUPlace()
places = fluid.xpu_places()

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    data = fluid.layers.data(name='X', shape=[1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    fluid.optimizer.SGD(learning_rate=0.01).minimize(loss)

train_program = fluid.CompiledProgram(train_program).with_data_parallel(
            loss_name=loss.name, places=places)
exe = fluid.Executor(place)
exe.run(startup_program)

train_data = numpy.random.random(size=(10, 1)).astype('float32')
loss_data, = exe.run(train_program, feed={"X": train_data}, fetch_list= [loss.name])

QingshuChen · 2021-01-14T03:00:56Z

cmake/external/xpu.cmake

这个可以等这个PR合入后rebase一下，https://github.com/PaddlePaddle/Paddle/pull/30381/files，目前比较新的api是0113

rebase done.

QingshuChen · 2021-01-14T03:02:14Z

paddle/fluid/framework/details/parallel_ssa_graph_executor.h

xpu_threaded_ssa_graph_executor这个名字是我根据fast_threaded_ssa_graph_executor取的，看下是否要改个名字。

How about bind_threaded_ssa_graph_executor?

QingshuChen · 2021-01-14T03:03:49Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

这边为什么报XPU没有内存，这边内存是分配在cpu上的。

Delete stream_op_count_ because we don't use multi stream.

zhiqiu · 2021-01-14T04:41:21Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

2018 -> 2021

zhiqiu · 2021-01-14T04:44:36Z

paddle/fluid/framework/parallel_executor.cc

Is log level == 1 neccessary?

zhiqiu · 2021-01-14T04:49:24Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

What is stream_op_count_ used for?

For counting op numbers per streaming. In this PR, every kunlun xpu device only support one stream for computing, multi stream for computing per device supported later.

没用先删掉，之后需要支持多stream再加

zhiqiu · 2021-01-14T04:50:56Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

What is thread pool(1) used for?

这个用于计算op，设置成1因为目前xpu的device_context不是线程安全的

wangxicoding · 2021-01-14T08:58:23Z

paddle/fluid/framework/CMakeLists.txt

xpu换成bind吧

wangxicoding · 2021-01-14T09:07:36Z

paddle/fluid/framework/details/parallel_ssa_graph_executor.cc

为啥不是放到上面的那个for训练里

delete it , and change to use origin executors_

wangxicoding · 2021-01-14T09:08:10Z

paddle/fluid/framework/details/parallel_ssa_graph_executor.cc

xpu_executors_为什么不是用原来的executors_

wangxicoding · 2021-01-14T09:08:46Z

paddle/fluid/framework/details/parallel_ssa_graph_executor.cc

都用executors_可以把这段代码去掉

wangxicoding · 2021-01-14T09:09:45Z

paddle/fluid/framework/details/parallel_ssa_graph_executor.h

都统一成executors_吧。里面的Executor可以用基类SSAGraphExecutor，

wangxicoding · 2021-01-14T09:46:21Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

没用先删掉，之后需要支持多stream再加

wangxicoding · 2021-01-14T09:47:18Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

这个用于计算op，设置成1因为目前xpu的device_context不是线程安全的

wangxicoding · 2021-01-14T09:53:38Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

这个地方不太优雅，可以wait一下thread。

wangxicoding · 2021-01-14T10:03:52Z

paddle/fluid/framework/parallel_executor.cc

这个define放到if里面吧

wangxicoding · 2021-01-14T10:04:48Z

paddle/fluid/framework/details/xpu_threaded_ssa_graph_executor.cc

可以把之前op_handle_base里面加的那些wait干掉了

chenwhql

好多重复代码，这个能继承FastThreadedSSAGraphExecutor，override部分function实现吗

chenwhql · 2021-01-15T06:13:41Z

paddle/fluid/framework/details/bind_threaded_ssa_graph_executor.h

@@ -0,0 +1,110 @@
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.


2018 -> 2021?

wanghuancoder

RunMultiDeviceOpAsync、RunOpAsyncMainStream这两个函数暂时没有看。如果方便可以加一些注释介绍设计思路，方面能够看懂代码。比较Executor是很复杂的。另外也应该在最上面加一些注释，介绍这个Executor适用于XPU的。否则多年以后，新人学习代码，都不知道这个Executor是干嘛用的。

wanghuancoder · 2021-01-15T07:04:49Z

paddle/fluid/framework/details/bind_threaded_ssa_graph_executor.cc

+    auto *op = new FetchOpHandle(fetch_node, fetches, i, &local_scopes_,
+                                 &local_exec_scopes_, true);


这里return_merged为啥直接设置成了true？从executor对外的api到底层的Executor，都有这个参数的传递。都支持return_merged（true、false）。这里删掉这个参数，直接设置成true是出于什么考虑？

把这个接口加回来了

wanghuancoder · 2021-01-15T07:16:31Z

paddle/fluid/framework/details/bind_threaded_ssa_graph_executor.cc

+    auto cur_op = ready_ops->Pop();
+    if (cur_op == nullptr) {
+      // sleep a while to make sure worker thread quit
+      sleep(10);


这里要等10秒？？

赶2.0发版，待此PR合入后再行修改

wanghuancoder · 2021-01-15T07:17:18Z

paddle/fluid/framework/details/bind_threaded_ssa_graph_executor.cc

+  while (exec_op_count_ < op_deps_.size()) {
+  }


这种写法，CPU直接100%了。还有上面的sleep问题。统一考虑加一个wait机制吧。

赶2.0发版，待此PR合入后再行修改

wanghuancoder · 2021-01-15T07:25:56Z

paddle/fluid/framework/details/bind_threaded_ssa_graph_executor.cc

+
+  platform::XPUPlace cur_place;
+  std::size_t cur_count = 0;
+  while (cur_count < op_deps_.size()) {


你的cur_count代表的是执行ready_ops力的op的个数。而ready_ops里是有ready_fetch_ops op的。假如op_deps_的op个数是100个。而ready_fetch_ops的个数是100个。瞬间就能达到跳出while的条件，但是真正的op_deps_里的op并没有执行完。

ready_ops除了有ready_fetch_ops还有bootstrap_ops_，达不到跳出while的条件。

wanghuancoder · 2021-01-15T07:27:50Z

paddle/fluid/framework/details/bind_threaded_ssa_graph_executor.cc

+  for (uint32_t i = 0; i < places.size(); i++) {
+    pool_.emplace_back(std::unique_ptr<::ThreadPool>(new ::ThreadPool(1)));
+  }


能否解释一下，每个线程池里1根线程，N个线程池，是出于什么考虑？

一个线程绑定一个XPU设备，因为目前xpu的device_context不是线程安全的。

…aphExecutor, test=windows_ci

wangxicoding

LGTM

…tion (PaddlePaddle#30317)

…tion (#30317) (#30535)

vslyu force-pushed the dev/xpu_pe3 branch from e73ce27 to e1b4b5b Compare January 11, 2021 13:40

QingshuChen reviewed Jan 14, 2021

View reviewed changes

zhiqiu reviewed Jan 14, 2021

View reviewed changes

wangxicoding requested changes Jan 14, 2021

View reviewed changes

add xpu executor, fix multi xpu train,test=kunlun

7fffeb3

vslyu force-pushed the dev/xpu_pe3 branch from b55fcd2 to 7fffeb3 Compare January 14, 2021 14:51

chenwhql reviewed Jan 15, 2021

View reviewed changes

wanghuancoder reviewed Jan 15, 2021

View reviewed changes

fix windows CI, rm semaphore, add return_merged for BindThreadedSSAGr…

23a4953

…aphExecutor, test=windows_ci

wangxicoding approved these changes Jan 18, 2021

View reviewed changes

wangxicoding merged commit 843dc3c into PaddlePaddle:develop Jan 18, 2021

vslyu added a commit to vslyu/Paddle that referenced this pull request Jan 18, 2021

[Kunlun]PR3: add xpu executor, multi xpu card train function optimiza…

d542cd1

…tion (PaddlePaddle#30317)

vslyu mentioned this pull request Jan 18, 2021

[Kunlun]2.0 cherry-pick: add xpu executor, multi xpu card train function optimization (#30317) #30535

Merged

fuyinno4 pushed a commit that referenced this pull request Jan 19, 2021

[Kunlun]PR3: add xpu executor, multi xpu card train function optimiza…

420fdbb

…tion (#30317) (#30535)

		@@ -0,0 +1,110 @@
		// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

		auto *op = new FetchOpHandle(fetch_node, fetches, i, &local_scopes_,
		&local_exec_scopes_, true);

Uh oh!

[Kunlun]PR3: add xpu executor, multi xpu card train function optimization #30317

[Kunlun]PR3: add xpu executor, multi xpu card train function optimization #30317

Uh oh!

Conversation

vslyu commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxicoding Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vslyu Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxicoding Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vslyu commented Jan 11, 2021 •

edited

Loading

wangxicoding Jan 14, 2021 •

edited

Loading

vslyu Jan 14, 2021 •

edited

Loading

wangxicoding Jan 14, 2021 •

edited

Loading