Skip to content

The CUDA Async Allocator #65092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 15, 2024
Merged

Conversation

eee4017
Copy link
Contributor

@eee4017 eee4017 commented Jun 12, 2024

PR Category

Others

PR Types

New features

Description

This pull request (PR) refactors the CUDA Asynchronous Allocator, introducing a new design for handling stream semantics within the allocator. The CUDA Asynchronous Allocator serves as an alternative to the streamsafe allocator, offloading all stream-ordered memory management tasks to CUDA. The CUDAMallocAsyncAllocator can be activated by setting the flag FLAG_use_cuda_malloc_async_allocator=1.

Why We Need the CUDA Async Allocator

(1) Reducing Memory Footprint with Other Python Libraries

In machine learning, we often use various Python libraries to process data. Many of these libraries, such as CuPy and OpenCV, have their own memory pools. When multiple programs or libraries use a GPU, they compete for memory. Memory allocated by one library cannot be easily deallocated by another. Therefore, using the CUDA Async Allocator allows a unified pool for these libraries and thus it is crucial for smoother integration with the Python GPU-computing ecosystem.

(2) Reducing Memory Footprint with CUDA Graphs

In PaddlePaddle, a memory pool is opened for each CUDA Graph, potentially leading to memory waste. When there are many graphs or frequent memory reuse between graphs, out-of-memory (OOM) errors can occur. This issue is demonstrated in PR #60516, where inefficient memory usage in the PP or VP training of GPT-3 with CUDA Graph enabled caused OOM errors using four H100-80GB GPUs with the default memory pool. The implementation of the CUDA Async Allocator has significantly reduced memory usage from 95% to 25%.

(3) Decreasing the Memory Management Burden on the Framework

The CUDA stream-ordered memory management API is more compatible with stream semantics, allowing for allocation and freeing on a specific stream. This reduces the workload on the framework. In the streamsafe allocator, cudaEventQuery is called a lot to determine if it can be released. With the Async Allocator, this process is simplified and offloaded to CUDA. Additionally, cudaMallocAsync and cudaFree are much faster than cudaMalloc and cudaFree, leading to performance improvements when cudaMalloc is frequently called by the streamsafe allocator.

(4) Stricter and Safer Memory Management

By fully offloading memory management to CUDA, the CUDA Async Allocator provides stricter memory management. In the streamsafe allocator, blocks are cached for reuse, meaning memory may not be freed immediately after a block is released. This can prevent CUDA errors from use-after-free scenarios. Additionally, the streamsafe allocator may allocate more memory to align with specific sizes, masking out-of-bound errors. The CUDA Async Allocator helps detect memory leaks and ensures safer GPU memory usage. Several bugs were detected with the CUDA Async Allocator, resulting in the creation of PRs for these bugs:

The Design of the CUDA Async Allocator

This document outlines the design and implementation of the CUDA Asynchronous Allocator. The design addresses how to handle stream semantics in memory management, improving efficiency and reducing memory overhead compared to the traditional streamsafe allocator.

(1) Stream Semantic of the CUDA Memory Management

Paddle drawio (1)

Figure 1. Stream Safe Allocator

Paddle drawio

Figure 2. CUDA Async Allocator

When a block is allocated on a stream, it might be used by other streams. Therefore, a mechanism is required to ensure that the block is freed after it is used on the specific stream.

In the streamsafe allocator (Figure 1):

  • A cudaEvent is created when the block is used on a stream.
  • When a block is released, it is pushed into a queue.
  • cudaEventQuery is used to determine if the block can be released.
  • The check ProcessUnfreedBlock is heuristically triggered when malloc is called, requiring continuous tracking to see if each block can be freed.

In the Async Allocator (Figure 2):

  • A free stream is used to free the block.
  • Inter-stream synchronization (cudaEventRecord/cudaStreamWaitEvent) ensures the block is freed after it is used.
  • Continuous tracking of blocks is not needed, offloading the burden to CUDA.

(2) Throttling Mechanism

When memory is under pressure (nearing OOM), the free operation may not be fast enough. Therefore, the allocation stream needs to be throttled (as indicated by the red arrow in Figure 2). When the memory utilization exceeds the memory_throttle_ratio, a stream synchronization operation is initiated before malloc.

utilization = (allocated_size + pending_release_size) / total_memory_size
if (utilization > memory_throttle_ratio) {
    sync(free_stream, malloc_stream);
}

During synchronization, all memory deallocation requests in the free queue are processed, reducing memory utilization before any new allocation operations proceed. Currently, the ratio is heuristically set to 80%, but it can be adjusted using FLAGS_cuda_malloc_async_pool_memory_throttle_ratio.

  • Lower memory_throttle_ratio Values: Trigger synchronization more frequently, improving memory utilization but possibly decreasing performance due to increased synchronization operations.
  • Higher memory_throttle_ratio Values: Allow more memory allocation before triggering synchronization, enhancing performance by reducing sync operations but increasing the risk of OOM conditions.

(3) Cooperate with CUDA Graph

A map called graph_owned_allocations_ is created to track blocks used in the graph. There are four distinct scenarios involving cudaMallocAsync, cudaFreeAsync in CUDA Graph

  1. Both Malloc and Free take place within a graph.
  2. Malloc takes place within a graph, but Free takes place outside the graph.
  3. Malloc takes place outside a graph, but Free takes place within a graph.
  4. Both Malloc and Free take place outside any graph.

We handle the release of the graph owned allocation in the above four cases:

  • Scenario 1: FreeImpl removes the allocation from graph_owned_allocations_, followed by FreeAllocation.
  • Scenario 2: We transfer the ownership of the block to the graph. We use a callback to free the allocation after the graph is destroyed.
  • Scenario 3: FreeImpl releases the allocation after the CUDA graph has completed its capture.
  • Scenario 4: FreeImpl calls FreeAllocation, and the allocation is freed.

Each element within graph_owned_allocations_ is malloc at AllocateImpl, but deallocation can occur in two ways:

  • Deallocation within FreeImpl: This implies the allocation is initialized and disposed of during a graph capture (Scenario 1).
  • Deallocation in the callback after the graph is destructed: The allocation is initialized during a graph capture but disposed of outside that context (Scenario 2).

Testing the CUDA Async Allocator

Set FLAG_use_cuda_malloc_async_allocator=1 and run all tests to validate that the CUDA Asynchronous Allocator can serve as an alternative to the stream-safe allocator. Note that some unrelated tests should be disabled

Copy link

paddle-bot bot commented Jun 12, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Jun 12, 2024
@jeng1220
Copy link
Collaborator

@eee4017 ,
Windows Build failed

liballocator.lib(cuda_malloc_async_allocator.cc.obj) : error LNK2019: unresolved external symbol "double paddle_flags::FLAGS_cuda_malloc_async_pool_memory_throttle_ratio" (?FLAGS_cuda_malloc_async_pool_memory_throttle_ratio@paddle_flags@@3NA) referenced in function "public: __cdecl paddle::memory::allocation::CUDAMallocAsyncAllocator::CUDAMallocAsyncAllocator(class std::shared_ptr,class phi::GPUPlace const &,struct CUstream_st *)"

@eee4017 eee4017 force-pushed the cuda_malloc_async_pool branch from b34d30f to a98cf12 Compare June 19, 2024 06:39
@jeng1220
Copy link
Collaborator

@eee4017 , Windows Build failed

liballocator.lib(cuda_malloc_async_allocator.cc.obj) : error LNK2019: unresolved external symbol "double paddle_flags::FLAGS_cuda_malloc_async_pool_memory_throttle_ratio" (?FLAGS_cuda_malloc_async_pool_memory_throttle_ratio@paddle_flags@@3NA) referenced in function "public: __cdecl paddle::memory::allocation::CUDAMallocAsyncAllocator::CUDAMallocAsyncAllocator(class std::shared_ptr,class phi::GPUPlace const &,struct CUstream_st *)"

It was resolved.

@tianshuo78520a
Copy link
Collaborator

PR-CI-Hygon-DCU 编译有些问题,需要修复
f66845748355465c369595cdd46e375d

@eee4017
Copy link
Contributor Author

eee4017 commented Jul 4, 2024

You must have one RD (phlrain or luotao1 or Aurelius84) approval for changing the FLAGS, which manages the environment variables.

@eee4017 eee4017 force-pushed the cuda_malloc_async_pool branch 3 times, most recently from a38f7dc to 58e2d5b Compare July 10, 2024 08:11
@eee4017 eee4017 force-pushed the cuda_malloc_async_pool branch from 58e2d5b to b3d9840 Compare July 11, 2024 02:27
@onecatcn onecatcn requested a review from phlrain July 11, 2024 05:21
/*
* CUDAMallocAsyncAllocator related FLAG
* Name: FLAGS_cuda_malloc_async_pool_memory_throttle_ratio
* Since Version: 2.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.7 -> 3.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@eee4017 eee4017 force-pushed the cuda_malloc_async_pool branch from c8f18b6 to 21e4dc1 Compare July 12, 2024 03:27
@zyfncg zyfncg merged commit 8b808f1 into PaddlePaddle:develop Jul 15, 2024
30 of 32 checks passed
risemeup1 added a commit that referenced this pull request Jul 15, 2024
lixcli pushed a commit to lixcli/Paddle that referenced this pull request Jul 22, 2024
* Async Pool and Memory Throttling

* fix rocm build

* fix flag

* fix rocm build

* fix flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers NVIDIA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants