-
Couldn't load subscription status.
- Fork 5.9k
[Kunlun]PR3: add xpu executor, multi xpu card train function optimization #30317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cmake/external/xpu.cmake
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个可以等这个PR合入后rebase一下,https://github.com/PaddlePaddle/Paddle/pull/30381/files, 目前比较新的api是0113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xpu_threaded_ssa_graph_executor这个名字是我根据fast_threaded_ssa_graph_executor取的,看下是否要改个名字。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about bind_threaded_ssa_graph_executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这边为什么报XPU没有内存,这边内存是分配在cpu上的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete stream_op_count_ because we don't use multi stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018 -> 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is log level == 1 neccessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is stream_op_count_ used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For counting op numbers per streaming. In this PR, every kunlun xpu device only support one stream for computing, multi stream for computing per device supported later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没用先删掉,之后需要支持多stream再加
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is thread pool(1) used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个用于计算op,设置成1因为目前xpu的device_context不是线程安全的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xpu换成bind吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为啥不是放到上面的那个for训练里
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete it , and change to use origin executors_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xpu_executors_为什么不是用原来的executors_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
都用executors_可以把这段代码去掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
都统一成executors_吧。里面的Executor可以用基类SSAGraphExecutor,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没用先删掉,之后需要支持多stream再加
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个用于计算op,设置成1因为目前xpu的device_context不是线程安全的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方不太优雅,可以wait一下thread。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个define放到if里面吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以把之前op_handle_base里面加的那些wait干掉了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好多重复代码,这个能继承FastThreadedSSAGraphExecutor,override部分function实现吗
| @@ -0,0 +1,110 @@ | |||
| // Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018 -> 2021?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RunMultiDeviceOpAsync、RunOpAsyncMainStream这两个函数暂时没有看。如果方便可以加一些注释介绍设计思路,方面能够看懂代码。比较Executor是很复杂的。另外也应该在最上面加一些注释,介绍这个Executor适用于XPU的。否则多年以后,新人学习代码,都不知道这个Executor是干嘛用的。
| auto *op = new FetchOpHandle(fetch_node, fetches, i, &local_scopes_, | ||
| &local_exec_scopes_, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里return_merged为啥直接设置成了true?从executor对外的api到底层的Executor,都有这个参数的传递。都支持return_merged(true、false)。这里删掉这个参数,直接设置成true是出于什么考虑?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把这个接口加回来了
| auto cur_op = ready_ops->Pop(); | ||
| if (cur_op == nullptr) { | ||
| // sleep a while to make sure worker thread quit | ||
| sleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里要等10秒??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
赶2.0发版,待此PR合入后再行修改
| while (exec_op_count_ < op_deps_.size()) { | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种写法,CPU直接100%了。还有上面的sleep问题。统一考虑加一个wait机制吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
赶2.0发版,待此PR合入后再行修改
|
|
||
| platform::XPUPlace cur_place; | ||
| std::size_t cur_count = 0; | ||
| while (cur_count < op_deps_.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你的cur_count代表的是执行ready_ops力的op的个数。而ready_ops里是有ready_fetch_ops op的。假如op_deps_的op个数是100个。而ready_fetch_ops的个数是100个。瞬间就能达到跳出while的条件,但是真正的op_deps_里的op并没有执行完。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready_ops除了有ready_fetch_ops还有bootstrap_ops_,达不到跳出while的条件。
| for (uint32_t i = 0; i < places.size(); i++) { | ||
| pool_.emplace_back(std::unique_ptr<::ThreadPool>(new ::ThreadPool(1))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能否解释一下,每个线程池里1根线程,N个线程池,是出于什么考虑?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个线程绑定一个XPU设备,因为目前xpu的device_context不是线程安全的。
…aphExecutor, test=windows_ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Function optimization
PR changes
Others
Describe
add xpu executor, multi xpu card train function optimization
cmake build flags:
examples: