Skip to content

Conversation

@vslyu
Copy link
Contributor

@vslyu vslyu commented Jan 11, 2021

PR types

Function optimization

PR changes

Others

Describe

add xpu executor, multi xpu card train function optimization

cmake build flags:

cmake -DWITH_XPU_BKCL=ON -DCMAKE_INSTALL_PREFIX=./output/ -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DWITH_MKL=OFF -DWITH_XPU=ON -DWITH_GPU=OFF -DWITH_FLUID_ONLY=ON -DPY_VERSION=3.7  -DWITH_TESTING=ON -DWITH_STYLE_CHECK=ON  ../

examples:

export FLAGS_selected_xpus=0,1
import numpy
import os
import paddle
import paddle.fluid as fluid
paddle.enable_static()

use_xpu = True
place = fluid.XPUPlace(0) if use_xpu else fluid.CPUPlace()
places = fluid.xpu_places()

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    data = fluid.layers.data(name='X', shape=[1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    fluid.optimizer.SGD(learning_rate=0.01).minimize(loss)

train_program = fluid.CompiledProgram(train_program).with_data_parallel(
            loss_name=loss.name, places=places)
exe = fluid.Executor(place)
exe.run(startup_program)

train_data = numpy.random.random(size=(10, 1)).astype('float32')
loss_data, = exe.run(train_program, feed={"X": train_data}, fetch_list= [loss.name])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个可以等这个PR合入后rebase一下,https://github.com/PaddlePaddle/Paddle/pull/30381/files, 目前比较新的api是0113

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xpu_threaded_ssa_graph_executor这个名字是我根据fast_threaded_ssa_graph_executor取的,看下是否要改个名字。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about bind_threaded_ssa_graph_executor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这边为什么报XPU没有内存,这边内存是分配在cpu上的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete stream_op_count_ because we don't use multi stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018 -> 2021

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is log level == 1 neccessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is stream_op_count_ used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For counting op numbers per streaming. In this PR, every kunlun xpu device only support one stream for computing, multi stream for computing per device supported later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没用先删掉,之后需要支持多stream再加

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is thread pool(1) used for?

Copy link
Contributor

@wangxicoding wangxicoding Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个用于计算op,设置成1因为目前xpu的device_context不是线程安全的

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xpu换成bind吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥不是放到上面的那个for训练里

Copy link
Contributor Author

@vslyu vslyu Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete it , and change to use origin executors_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xpu_executors_为什么不是用原来的executors_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix, done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

都用executors_可以把这段代码去掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

都统一成executors_吧。里面的Executor可以用基类SSAGraphExecutor,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没用先删掉,之后需要支持多stream再加

Copy link
Contributor

@wangxicoding wangxicoding Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个用于计算op,设置成1因为目前xpu的device_context不是线程安全的

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方不太优雅,可以wait一下thread。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个define放到if里面吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以把之前op_handle_base里面加的那些wait干掉了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好多重复代码,这个能继承FastThreadedSSAGraphExecutor,override部分function实现吗

@@ -0,0 +1,110 @@
// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018 -> 2021?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunMultiDeviceOpAsync、RunOpAsyncMainStream这两个函数暂时没有看。如果方便可以加一些注释介绍设计思路,方面能够看懂代码。比较Executor是很复杂的。另外也应该在最上面加一些注释,介绍这个Executor适用于XPU的。否则多年以后,新人学习代码,都不知道这个Executor是干嘛用的。

Comment on lines 208 to 209
auto *op = new FetchOpHandle(fetch_node, fetches, i, &local_scopes_,
&local_exec_scopes_, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里return_merged为啥直接设置成了true?从executor对外的api到底层的Executor,都有这个参数的传递。都支持return_merged(true、false)。这里删掉这个参数,直接设置成true是出于什么考虑?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把这个接口加回来了

auto cur_op = ready_ops->Pop();
if (cur_op == nullptr) {
// sleep a while to make sure worker thread quit
sleep(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里要等10秒??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赶2.0发版,待此PR合入后再行修改

Comment on lines +158 to +159
while (exec_op_count_ < op_deps_.size()) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种写法,CPU直接100%了。还有上面的sleep问题。统一考虑加一个wait机制吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赶2.0发版,待此PR合入后再行修改


platform::XPUPlace cur_place;
std::size_t cur_count = 0;
while (cur_count < op_deps_.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你的cur_count代表的是执行ready_ops力的op的个数。而ready_ops里是有ready_fetch_ops op的。假如op_deps_的op个数是100个。而ready_fetch_ops的个数是100个。瞬间就能达到跳出while的条件,但是真正的op_deps_里的op并没有执行完。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready_ops除了有ready_fetch_ops还有bootstrap_ops_,达不到跳出while的条件。

Comment on lines +47 to +49
for (uint32_t i = 0; i < places.size(); i++) {
pool_.emplace_back(std::unique_ptr<::ThreadPool>(new ::ThreadPool(1)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否解释一下,每个线程池里1根线程,N个线程池,是出于什么考虑?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个线程绑定一个XPU设备,因为目前xpu的device_context不是线程安全的。

Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants