The CUDA Async Allocator #65092

eee4017 · 2024-06-12T16:36:54Z

PR Category

Others

PR Types

New features

Description

This pull request (PR) refactors the CUDA Asynchronous Allocator, introducing a new design for handling stream semantics within the allocator. The CUDA Asynchronous Allocator serves as an alternative to the streamsafe allocator, offloading all stream-ordered memory management tasks to CUDA. The CUDAMallocAsyncAllocator can be activated by setting the flag FLAG_use_cuda_malloc_async_allocator=1.

Why We Need the CUDA Async Allocator

(1) Reducing Memory Footprint with Other Python Libraries

In machine learning, we often use various Python libraries to process data. Many of these libraries, such as CuPy and OpenCV, have their own memory pools. When multiple programs or libraries use a GPU, they compete for memory. Memory allocated by one library cannot be easily deallocated by another. Therefore, using the CUDA Async Allocator allows a unified pool for these libraries and thus it is crucial for smoother integration with the Python GPU-computing ecosystem.

(2) Reducing Memory Footprint with CUDA Graphs

In PaddlePaddle, a memory pool is opened for each CUDA Graph, potentially leading to memory waste. When there are many graphs or frequent memory reuse between graphs, out-of-memory (OOM) errors can occur. This issue is demonstrated in PR #60516, where inefficient memory usage in the PP or VP training of GPT-3 with CUDA Graph enabled caused OOM errors using four H100-80GB GPUs with the default memory pool. The implementation of the CUDA Async Allocator has significantly reduced memory usage from 95% to 25%.

(3) Decreasing the Memory Management Burden on the Framework

The CUDA stream-ordered memory management API is more compatible with stream semantics, allowing for allocation and freeing on a specific stream. This reduces the workload on the framework. In the streamsafe allocator, cudaEventQuery is called a lot to determine if it can be released. With the Async Allocator, this process is simplified and offloaded to CUDA. Additionally, cudaMallocAsync and cudaFree are much faster than cudaMalloc and cudaFree, leading to performance improvements when cudaMalloc is frequently called by the streamsafe allocator.

(4) Stricter and Safer Memory Management

By fully offloading memory management to CUDA, the CUDA Async Allocator provides stricter memory management. In the streamsafe allocator, blocks are cached for reuse, meaning memory may not be freed immediately after a block is released. This can prevent CUDA errors from use-after-free scenarios. Additionally, the streamsafe allocator may allocate more memory to align with specific sizes, masking out-of-bound errors. The CUDA Async Allocator helps detect memory leaks and ensures safer GPU memory usage. Several bugs were detected with the CUDA Async Allocator, resulting in the creation of PRs for these bugs:

The Design of the CUDA Async Allocator

This document outlines the design and implementation of the CUDA Asynchronous Allocator. The design addresses how to handle stream semantics in memory management, improving efficiency and reducing memory overhead compared to the traditional streamsafe allocator.

(1) Stream Semantic of the CUDA Memory Management

Figure 1. Stream Safe Allocator

Figure 2. CUDA Async Allocator

When a block is allocated on a stream, it might be used by other streams. Therefore, a mechanism is required to ensure that the block is freed after it is used on the specific stream.

In the streamsafe allocator (Figure 1):

A cudaEvent is created when the block is used on a stream.
When a block is released, it is pushed into a queue.
cudaEventQuery is used to determine if the block can be released.
The check ProcessUnfreedBlock is heuristically triggered when malloc is called, requiring continuous tracking to see if each block can be freed.

In the Async Allocator (Figure 2):

A free stream is used to free the block.
Inter-stream synchronization (cudaEventRecord/cudaStreamWaitEvent) ensures the block is freed after it is used.
Continuous tracking of blocks is not needed, offloading the burden to CUDA.

(2) Throttling Mechanism

When memory is under pressure (nearing OOM), the free operation may not be fast enough. Therefore, the allocation stream needs to be throttled (as indicated by the red arrow in Figure 2). When the memory utilization exceeds the memory_throttle_ratio, a stream synchronization operation is initiated before malloc.

utilization = (allocated_size + pending_release_size) / total_memory_size
if (utilization > memory_throttle_ratio) {
    sync(free_stream, malloc_stream);
}

During synchronization, all memory deallocation requests in the free queue are processed, reducing memory utilization before any new allocation operations proceed. Currently, the ratio is heuristically set to 80%, but it can be adjusted using FLAGS_cuda_malloc_async_pool_memory_throttle_ratio.

Lower memory_throttle_ratio Values: Trigger synchronization more frequently, improving memory utilization but possibly decreasing performance due to increased synchronization operations.
Higher memory_throttle_ratio Values: Allow more memory allocation before triggering synchronization, enhancing performance by reducing sync operations but increasing the risk of OOM conditions.

(3) Cooperate with CUDA Graph

A map called graph_owned_allocations_ is created to track blocks used in the graph. There are four distinct scenarios involving cudaMallocAsync, cudaFreeAsync in CUDA Graph

Both Malloc and Free take place within a graph.
Malloc takes place within a graph, but Free takes place outside the graph.
Malloc takes place outside a graph, but Free takes place within a graph.
Both Malloc and Free take place outside any graph.

We handle the release of the graph owned allocation in the above four cases:

Scenario 1: FreeImpl removes the allocation from graph_owned_allocations_, followed by FreeAllocation.
Scenario 2: We transfer the ownership of the block to the graph. We use a callback to free the allocation after the graph is destroyed.
Scenario 3: FreeImpl releases the allocation after the CUDA graph has completed its capture.
Scenario 4: FreeImpl calls FreeAllocation, and the allocation is freed.

Each element within graph_owned_allocations_ is malloc at AllocateImpl, but deallocation can occur in two ways:

Deallocation within FreeImpl: This implies the allocation is initialized and disposed of during a graph capture (Scenario 1).
Deallocation in the callback after the graph is destructed: The allocation is initialized during a graph capture but disposed of outside that context (Scenario 2).

Testing the CUDA Async Allocator

Set FLAG_use_cuda_malloc_async_allocator=1 and run all tests to validate that the CUDA Asynchronous Allocator can serve as an alternative to the stream-safe allocator. Note that some unrelated tests should be disabled

Disabling Unrelated Tests When Enabling CUDA Async Allocator in CI #65094

paddle-bot · 2024-06-12T16:36:59Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

jeng1220 · 2024-06-13T02:30:52Z

@eee4017 ,
Windows Build failed

liballocator.lib(cuda_malloc_async_allocator.cc.obj) : error LNK2019: unresolved external symbol "double paddle_flags::FLAGS_cuda_malloc_async_pool_memory_throttle_ratio" (?FLAGS_cuda_malloc_async_pool_memory_throttle_ratio@paddle_flags@@3NA) referenced in function "public: __cdecl paddle::memory::allocation::CUDAMallocAsyncAllocator::CUDAMallocAsyncAllocator(class std::shared_ptr,class phi::GPUPlace const &,struct CUstream_st *)"

jeng1220 · 2024-06-20T02:28:51Z

@eee4017 , Windows Build failed

liballocator.lib(cuda_malloc_async_allocator.cc.obj) : error LNK2019: unresolved external symbol "double paddle_flags::FLAGS_cuda_malloc_async_pool_memory_throttle_ratio" (?FLAGS_cuda_malloc_async_pool_memory_throttle_ratio@paddle_flags@@3NA) referenced in function "public: __cdecl paddle::memory::allocation::CUDAMallocAsyncAllocator::CUDAMallocAsyncAllocator(class std::shared_ptr,class phi::GPUPlace const &,struct CUstream_st *)"

It was resolved.

tianshuo78520a · 2024-06-20T11:00:51Z

PR-CI-Hygon-DCU 编译有些问题，需要修复

eee4017 · 2024-07-04T02:37:36Z

You must have one RD (phlrain or luotao1 or Aurelius84) approval for changing the FLAGS, which manages the environment variables.

zyfncg · 2024-07-12T03:22:57Z

paddle/common/flags.cc

+/*
+ * CUDAMallocAsyncAllocator related FLAG
+ * Name: FLAGS_cuda_malloc_async_pool_memory_throttle_ratio
+ * Since Version: 2.7


…_async_pool

This reverts commit 8b808f1.

* Async Pool and Memory Throttling * fix rocm build * fix flag * fix rocm build * fix flag

paddle-bot bot added the contributor External developers label Jun 12, 2024

jeng1220 added the NVIDIA label Jun 13, 2024

onecatcn assigned zyfncg Jun 14, 2024

eee4017 force-pushed the cuda_malloc_async_pool branch from b34d30f to a98cf12 Compare June 19, 2024 06:39

eee4017 force-pushed the cuda_malloc_async_pool branch 11 times, most recently from 5865eaf to 29665b2 Compare July 3, 2024 04:01

eee4017 mentioned this pull request Jul 3, 2024

[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634

Merged

eee4017 force-pushed the cuda_malloc_async_pool branch from 29665b2 to edc6058 Compare July 3, 2024 05:49

eee4017 force-pushed the cuda_malloc_async_pool branch 3 times, most recently from a38f7dc to 58e2d5b Compare July 10, 2024 08:11

eee4017 added 4 commits July 11, 2024 02:27

Async Pool and Memory Throttling

7d6f78a

fix rocm build

360e382

fix flag

2500026

fix rocm build

b3d9840

eee4017 force-pushed the cuda_malloc_async_pool branch from 58e2d5b to b3d9840 Compare July 11, 2024 02:27

onecatcn requested a review from phlrain July 11, 2024 05:21

zyfncg reviewed Jul 12, 2024

View reviewed changes

Merge remote-tracking branch 'github_origin/develop' into cuda_malloc…

21e4dc1

…_async_pool

eee4017 force-pushed the cuda_malloc_async_pool branch from c8f18b6 to 21e4dc1 Compare July 12, 2024 03:27

fix flag

5f25a64

zyfncg approved these changes Jul 15, 2024

View reviewed changes

phlrain approved these changes Jul 15, 2024

View reviewed changes

zyfncg merged commit 8b808f1 into PaddlePaddle:develop Jul 15, 2024
30 of 32 checks passed

risemeup1 added a commit that referenced this pull request Jul 15, 2024

Revert "The CUDA Async Allocator (#65092)"

d09fbfa

This reverts commit 8b808f1.

risemeup1 mentioned this pull request Jul 15, 2024

Revert "The CUDA Async Allocator" #66068

Closed

lixcli pushed a commit to lixcli/Paddle that referenced this pull request Jul 22, 2024

The CUDA Async Allocator (PaddlePaddle#65092)

635c387

* Async Pool and Memory Throttling * fix rocm build * fix flag * fix rocm build * fix flag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The CUDA Async Allocator #65092

The CUDA Async Allocator #65092

Uh oh!

eee4017 commented Jun 12, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Jun 12, 2024

Uh oh!

jeng1220 commented Jun 13, 2024

Uh oh!

jeng1220 commented Jun 20, 2024

Uh oh!

tianshuo78520a commented Jun 20, 2024

Uh oh!

eee4017 commented Jul 4, 2024

Uh oh!

zyfncg Jul 12, 2024

Uh oh!

eee4017 Jul 12, 2024

Uh oh!

Uh oh!

Uh oh!

The CUDA Async Allocator #65092

The CUDA Async Allocator #65092

Uh oh!

Conversation

eee4017 commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Why We Need the CUDA Async Allocator

The Design of the CUDA Async Allocator

Testing the CUDA Async Allocator

Uh oh!

paddle-bot bot commented Jun 12, 2024

Uh oh!

jeng1220 commented Jun 13, 2024

Uh oh!

jeng1220 commented Jun 20, 2024

Uh oh!

tianshuo78520a commented Jun 20, 2024

Uh oh!

eee4017 commented Jul 4, 2024

Uh oh!

zyfncg Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eee4017 commented Jun 12, 2024 •

edited

Loading