-
Notifications
You must be signed in to change notification settings - Fork 5.8k
The CUDA Async Allocator #65092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The CUDA Async Allocator #65092
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@eee4017 , liballocator.lib(cuda_malloc_async_allocator.cc.obj) : error LNK2019: unresolved external symbol "double paddle_flags::FLAGS_cuda_malloc_async_pool_memory_throttle_ratio" (?FLAGS_cuda_malloc_async_pool_memory_throttle_ratio@paddle_flags@@3NA) referenced in function "public: __cdecl paddle::memory::allocation::CUDAMallocAsyncAllocator::CUDAMallocAsyncAllocator(class std::shared_ptr,class phi::GPUPlace const &,struct CUstream_st *)" |
b34d30f
to
a98cf12
Compare
It was resolved. |
5865eaf
to
29665b2
Compare
29665b2
to
edc6058
Compare
You must have one RD (phlrain or luotao1 or Aurelius84) approval for changing the FLAGS, which manages the environment variables. |
a38f7dc
to
58e2d5b
Compare
58e2d5b
to
b3d9840
Compare
paddle/common/flags.cc
Outdated
/* | ||
* CUDAMallocAsyncAllocator related FLAG | ||
* Name: FLAGS_cuda_malloc_async_pool_memory_throttle_ratio | ||
* Since Version: 2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.7 -> 3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
c8f18b6
to
21e4dc1
Compare
This reverts commit 8b808f1.
* Async Pool and Memory Throttling * fix rocm build * fix flag * fix rocm build * fix flag
PR Category
Others
PR Types
New features
Description
This pull request (PR) refactors the CUDA Asynchronous Allocator, introducing a new design for handling stream semantics within the allocator. The CUDA Asynchronous Allocator serves as an alternative to the streamsafe allocator, offloading all stream-ordered memory management tasks to CUDA. The CUDAMallocAsyncAllocator can be activated by setting the flag
FLAG_use_cuda_malloc_async_allocator=1
.Why We Need the CUDA Async Allocator
(1) Reducing Memory Footprint with Other Python Libraries
In machine learning, we often use various Python libraries to process data. Many of these libraries, such as CuPy and OpenCV, have their own memory pools. When multiple programs or libraries use a GPU, they compete for memory. Memory allocated by one library cannot be easily deallocated by another. Therefore, using the CUDA Async Allocator allows a unified pool for these libraries and thus it is crucial for smoother integration with the Python GPU-computing ecosystem.
(2) Reducing Memory Footprint with CUDA Graphs
In PaddlePaddle, a memory pool is opened for each CUDA Graph, potentially leading to memory waste. When there are many graphs or frequent memory reuse between graphs, out-of-memory (OOM) errors can occur. This issue is demonstrated in PR #60516, where inefficient memory usage in the PP or VP training of GPT-3 with CUDA Graph enabled caused OOM errors using four H100-80GB GPUs with the default memory pool. The implementation of the CUDA Async Allocator has significantly reduced memory usage from 95% to 25%.
(3) Decreasing the Memory Management Burden on the Framework
The CUDA stream-ordered memory management API is more compatible with stream semantics, allowing for allocation and freeing on a specific stream. This reduces the workload on the framework. In the streamsafe allocator,
cudaEventQuery
is called a lot to determine if it can be released. With the Async Allocator, this process is simplified and offloaded to CUDA. Additionally,cudaMallocAsync
andcudaFree
are much faster thancudaMalloc
andcudaFree
, leading to performance improvements whencudaMalloc
is frequently called by the streamsafe allocator.(4) Stricter and Safer Memory Management
By fully offloading memory management to CUDA, the CUDA Async Allocator provides stricter memory management. In the streamsafe allocator, blocks are cached for reuse, meaning memory may not be freed immediately after a block is released. This can prevent CUDA errors from use-after-free scenarios. Additionally, the streamsafe allocator may allocate more memory to align with specific sizes, masking out-of-bound errors. The CUDA Async Allocator helps detect memory leaks and ensures safer GPU memory usage. Several bugs were detected with the CUDA Async Allocator, resulting in the creation of PRs for these bugs:
test_from_blob
#65023all_to_all
#65093The Design of the CUDA Async Allocator
This document outlines the design and implementation of the CUDA Asynchronous Allocator. The design addresses how to handle stream semantics in memory management, improving efficiency and reducing memory overhead compared to the traditional streamsafe allocator.
(1) Stream Semantic of the CUDA Memory Management
Figure 1. Stream Safe Allocator
Figure 2. CUDA Async Allocator
When a block is allocated on a stream, it might be used by other streams. Therefore, a mechanism is required to ensure that the block is freed after it is used on the specific stream.
In the streamsafe allocator (Figure 1):
cudaEvent
is created when the block is used on a stream.cudaEventQuery
is used to determine if the block can be released.ProcessUnfreedBlock
is heuristically triggered whenmalloc
is called, requiring continuous tracking to see if each block can be freed.In the Async Allocator (Figure 2):
cudaEventRecord
/cudaStreamWaitEvent
) ensures the block is freed after it is used.(2) Throttling Mechanism
When memory is under pressure (nearing OOM), the free operation may not be fast enough. Therefore, the allocation stream needs to be throttled (as indicated by the red arrow in Figure 2). When the memory utilization exceeds the
memory_throttle_ratio
, a stream synchronization operation is initiated beforemalloc
.During synchronization, all memory deallocation requests in the free queue are processed, reducing memory utilization before any new allocation operations proceed. Currently, the ratio is heuristically set to 80%, but it can be adjusted using
FLAGS_cuda_malloc_async_pool_memory_throttle_ratio
.memory_throttle_ratio
Values: Trigger synchronization more frequently, improving memory utilization but possibly decreasing performance due to increased synchronization operations.memory_throttle_ratio
Values: Allow more memory allocation before triggering synchronization, enhancing performance by reducing sync operations but increasing the risk of OOM conditions.(3) Cooperate with CUDA Graph
A map called
graph_owned_allocations_
is created to track blocks used in the graph. There are four distinct scenarios involvingcudaMallocAsync
,cudaFreeAsync
in CUDA GraphMalloc
andFree
take place within a graph.Malloc
takes place within a graph, butFree
takes place outside the graph.Malloc
takes place outside a graph, butFree
takes place within a graph.Malloc
andFree
take place outside any graph.We handle the release of the graph owned allocation in the above four cases:
FreeImpl
removes the allocation fromgraph_owned_allocations_
, followed byFreeAllocation
.FreeImpl
releases the allocation after the CUDA graph has completed its capture.FreeImpl
callsFreeAllocation
, and the allocation is freed.Each element within
graph_owned_allocations_
is malloc atAllocateImpl
, but deallocation can occur in two ways:FreeImpl
: This implies the allocation is initialized and disposed of during a graph capture (Scenario 1).Testing the CUDA Async Allocator
Set
FLAG_use_cuda_malloc_async_allocator=1
and run all tests to validate that the CUDA Asynchronous Allocator can serve as an alternative to the stream-safe allocator. Note that some unrelated tests should be disabled