Skip to content

Utilize StreamSafeCUDAAllocator to support fast GC in new executor #37642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

From00
Copy link
Contributor

@From00 From00 commented Nov 28, 2021

PR types

New features

PR changes

Others

Describe

We have impletemented a StreamSafeCUDAAllocator in #37290, yet in this PR we do the following things to support fast GC in new executor:

  1. Add a memory::InSameStream function, which is used in mutil-stream Tensor::mutable_data function to avoid repeatedly alloc memory of the same stream
  2. Implement a RecordStreamForGC function in Interpretercore, which synchronizes the vars for a cross-stream OP, i.e., calls RecordStream before deleting CUDA tensor
  3. Apply fast GC using the above implemented functions when FLAGS_fast_eager_deletion_mode and FLAGS_use_stream_safe_cuda_allocator are both set to true

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已线下讨论H2D的一种场景,需要添加逻辑支持。

Copy link
Contributor

@Aurelius84 Aurelius84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work~~

@@ -693,6 +693,29 @@ const std::vector<size_t>& Instruction::GCCheckVars() const {
return gc_check_var_list_;
}

void Instruction::InferNeedStreamSyncVars() {
auto ParseIndexFromMap = [](
std::map<std::string, std::vector<int>>& index_map,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::map<std::string, std::vector<int>>& index_map,
const std::map<std::string, std::vector<int>>& index_map,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This code has been deleted, please check the new commit.

@From00
Copy link
Contributor Author

From00 commented Nov 30, 2021

已线下讨论H2D的一种场景,需要添加逻辑支持。

代码已修改,新的设计如下:

  1. OP跑在哪个stream,其申请的显存就在哪个stream。对应到新执行器,目前H2D和D2H申请的显存都在单独的stream,其它计算OP申请的显存默认在主stream。
  2. OP执行完,CheckGC之前,gc_check_var_list_里的变量都会做recored。如果是同stream的record,allocator中会做判断后直接返回,开销不会很大。

Aurelius84
Aurelius84 previously approved these changes Nov 30, 2021
Copy link
Contributor

@Aurelius84 Aurelius84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -235,5 +237,13 @@ void Tensor::ResetHolderWithType(std::shared_ptr<memory::Allocation> holder,

void Tensor::set_type(const proto::VarType::Type& type) { type_ = type; }

#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
Copy link
Contributor

@Shixiaowei02 Shixiaowei02 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在不同设备共用的核心数据结构中,原则上禁止用宏定义控制 class / struct 中成员的类型或个数。确有需要的,需讨论后再合入。上个提交已经加入的,正在商量修改方法去除

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macros used in Tensor/LoDTensor declaration will severely block work progresses on Pten, please modify.

Done, thanks.

@@ -290,6 +290,10 @@ class Tensor {
return *inplace_version_counter_;
}

#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,需要修改

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码已经修改,在Tensor类中去除了宏控制。

@@ -151,11 +151,12 @@ class AllocatorFacadePrivate {
}
Copy link
Contributor

@Shixiaowei02 Shixiaowei02 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#37978
如果可能,这个PR里,请改用 unique_ptr<Allocation> 替代 shared_ptr<Allocation>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以,后续新写一个PR对base_ptr_test.cu单测中的shared_ptr替换成unique_ptr

@@ -235,5 +237,13 @@ void Tensor::ResetHolderWithType(std::shared_ptr<memory::Allocation> holder,

void Tensor::set_type(const proto::VarType::Type& type) { type_ = type; }

#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
void Tensor::RecordStream(const gpuStream_t& stream) {
Copy link
Contributor

@jim19930609 jim19930609 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding backend/device specific interfaces to backend-agnostic data structures such as Tensor is strongly unrecommended, and will block works related to Pten and Dygraph Refactor.

Taking "Tensor::RecordStream" as an example. When adding this interface to Tensor, the principle is that the interface has to "exist" whatsoever (regardless of the macro, the backend, the way you compile...), but you may use macros inside a source file to control how this interface is implemented towards different backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macros have been deleted, please check the new commit.

Copy link
Contributor

@jim19930609 jim19930609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macros used in Tensor/LoDTensor declaration will severely block work progresses on Pten, please modify.

jim19930609
jim19930609 previously approved these changes Dec 27, 2021
Copy link
Contributor

@jim19930609 jim19930609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

… utilize-stream-safe-cuda-allocator-in-new-executor-gc
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhiqiu zhiqiu merged commit 0c7153a into PaddlePaddle:develop Dec 28, 2021
@From00 From00 deleted the utilize-stream-safe-cuda-allocator-in-new-executor-gc branch December 30, 2021 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants