[Transform] Only use gc runtime allocator for stack-like alloca ops #287

crazydemo · 2024-08-27T07:12:13Z

use createPromoteBuffersToStackPass to convert all legal stack-like allocations into alloca, and only do gc runtime allocator transformation on allocaOp, the corresponding dealloc is inserted.

Fix Issue#279, Issue#278

ciyongch

It will be good to add a simple case to cover this transform fallback instead of throwing error?

ciyongch · 2024-08-27T07:48:05Z

lib/gc/Transforms/MemRefToCPURuntime.cpp

+        } else if (isa<memref::DeallocOp>(op)) {
+          // fallback alloc / dealloc not in FILO fashion
+          Value deallocMemref = op->getOperands().front();
+          auto aliases = analysis.resolveReverse(deallocMemref);


Do we really need the aliases for analyzing deallocMemref vs AllocMemref? They'll be exactly same, right?

According to createBufferDeallocationPass, deallocMemref will be the same as AllocMemref. But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass. Do you think we can omit this possibility, and do not use alias analysis?

func.func @alloc_dealloc_view() { // CHECK-LABEL: func @alloc_dealloc_view() // CHECK: %[[m0:.*]] = cpuruntime.alloc() : memref<128xi8> %c0 = arith.constant 0: index %f0 = arith.constant 0.0: f32 %alloc = memref.alloc() : memref<128xi8> %view = memref.view %alloc[%c0][] : memref<128xi8> to memref<32xf32> %subview = memref.subview %view[0][16][1] : memref<32xf32> to memref<16xf32> // CHECK: cpuruntime.dealloc memref.dealloc %subview : memref<16xf32> return }

Why do we have such case when it is not stack-like?

This case is from Issue#279. After bufferDeallocation pass, we will get the following IR, whose alloc / dealloc are in FIFO fashion. This case should be resolved by PR#283, with MergeAlloc pass enabled, thus the small temp buffers will be merged into a larger buffer.
It will be better that the stack-like alloc / dealloc to be checked, in case we still have some non-stack like allocations in some complex scenarios.

scf.parallel (%arg11) = (%c0_3) to (%c32_4) step (%c16_5) { %alloc_18 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xbf16> %subview = memref.subview %arg2[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>> %subview_19 = memref.subview %expand_shape[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>> scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) { %3 = memref.load %subview[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>> %4 = memref.load %subview_19[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>> %5 = arith.extf %4 fastmath<contract> : bf16 to f32 %6 = arith.extf %3 fastmath<contract> : bf16 to f32 %7 = arith.addf %6, %5 : f32 %8 = arith.truncf %7 fastmath<contract> : f32 to bf16 memref.store %8, %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16> scf.reduce } %alloc_20 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32> scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) { %3 = memref.load %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16> %4 = arith.extf %3 : bf16 to f32 memref.store %4, %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32> scf.reduce } memref.dealloc %alloc_18 : memref<1x16x4096xbf16> %alloc_21 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32> scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) { %3 = memref.load %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32> %4 = memref.load %0[%arg12, %arg13, %arg14] : memref<1x16x4096xf32> %5 = math.powf %3, %4 : f32 memref.store %5, %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32> scf.reduce } memref.dealloc %alloc_20 : memref<1x16x4096xf32> %alloc_22 = memref.alloc() {alignment = 64 : i64} : memref<1x16xf32> scf.parallel (%arg12, %arg13) = (%c0, %c0) to (%c1, %c16) step (%c1, %c1) { memref.store %cst, %alloc_22[%arg12, %arg13] : memref<1x16xf32> scf.for %arg14 = %c0 to %c4096 step %c1 { %3 = memref.load %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32> %4 = memref.load %alloc_22[%arg12, %arg13] : memref<1x16xf32> %5 = arith.addf %3, %4 : f32 memref.store %5, %alloc_22[%arg12, %arg13] : memref<1x16xf32> } scf.reduce }

But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass. Do you think we can omit this possibility, and do not use alias analysis?

I don't think it's valid to dealloc/free the partial buffer by either view or subivew of memref.alloc, please check the statement here: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td#L552-L555

Thanks for pointing out this reference. I will remove the alias.

Why do we have such case when it is not stack-like?

Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass, this fallback can be a workaround before the memref-merge patch is pulled in or in case there'll be non FILO alloc/dealloc scenarios.

The alloc/free used to be FILO (stack like). Is there any recent optimization in upstream changing that?

Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass,

The buffer-merge pass is an optimization and can be turned off at -O0 in the future. I suppose we should always make this pass safe to run.

If we are handling dynamic shaped buffers, the buffer-merge won't work and we still need many small allocations. I would suggest in our pass pipeline that if stack-like allocator is on, we should try to make sure BufferDeallocationPipeline maintain a FILO pattern. If stack-like allocator is off, we can enable non-FILO pattern in BufferDeallocationPipeline

crazydemo · 2024-08-27T08:18:00Z

It will be good to add a simple case to cover this transform fallback instead of throwing error?

added test case.

…cator

Menooker · 2024-08-28T02:49:02Z

lib/gc/Transforms/MemRefToCPURuntime.cpp

@@ -96,6 +97,24 @@ struct ConvertMemRefToCPURuntime
              }
            }
          }
+        } else if (isa<memref::AllocOp>(op)) {


Please note that the walk() order may differ from the execution order. And what about the case

func AAA() { a=alloc() b=alloc() c=alloc() dealloc(b) dealloc(a) return c }

The optimization still applies.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() { a=alloc() b=alloc() c=alloc() if(...) { dealloc(c) } else { dealloc(b) } dealloc(a) }

It looks like stack-like, but it is not.

I would suggest a stricter checking:

first, check if an alloc is "leaked" from the scope (already done in your code in main branch)

for the non-leaked allocations only: check if the order of alloc/free is stack-like in the same Region, and ignore the leaked allocations. Usually, we can ensure the execution order in the same region.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() { a=alloc() b=alloc() c=alloc() if(...) { dealloc(c) } else { dealloc(b) } dealloc(a) }

Not sure if it's a valid case or not, as it will always result in memory leak?

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() { a=alloc() b=alloc() c=alloc() if(...) { dealloc(c) } else { dealloc(b) } dealloc(a) }

Not sure if it's a valid case or not, as it will always result in memory leak?

I think we are targeting a general MLIR compiler, so any of the user input should be handled. Users can always write code like this. At least it will not produce wrong result - if they can live with memory leak.

crazydemo · 2024-08-28T05:47:13Z

PR#295 aligns the bufferization pipeline with the upstream, and can fix Issue#279, Issue#278, Issue#284.

The current PR needs to be modified, seems allocaOp can be transformed to gc runtime allocator.

…line' into zhangyan/enhance_allocator

…cator

…c inserted

ciyongch

This change tries to resolve the FILO allocation issue, but also result in converting all the stack allocations(even with small size) into runtime allocator, maybe not that efficiency?

crazydemo · 2024-08-29T08:45:06Z

This change tries to resolve the FILO allocation issue, but also result in converting all the stack allocations(even with small size) into runtime allocator, maybe not that efficiency?

There're several ways to solve the alloc / dealloc style issue:

the current PR's solution, reuse the PromoteBuffersToStackPass.
modify the upstream bufferization pipeline, may need a lot discussion with the community.
create a GC pass, do the deallocation on GC side, may contain a large amount of logic that is redundant with upstream.
create a GC pass, reorganize the generated memref.alloc / memref.dealloc ops. The IR generated by the bufferization pipeline tends to have dealloc operations concentrated at the end of a region, while the order of deallocs is FIFO. Writing this pass should be kind of straightforward, but it requires extensive testing.

We can go with item1, but needs some enhancement to distinguish the buffer size.

Menooker · 2024-08-30T01:34:32Z

lib/gc/Transforms/MemRefToCPURuntime.cpp

+        if (firstDeallocOp) {
+          builder.setInsertionPoint(firstDeallocOp);
+        } else {
+          // If no deallocOp was found, insert at the end of the region before


Please insert deallocop at the end of the parent alloca_scope op. It may not be the same as op->getParentRegion()

And I am not sure why we need to find firstDeallocOp?

Another issue is that, if the alloca size is very small, I think we'd better keep using the alloca instead of a runtime function. In GC v1 I remember that we have a size limit at ~128 bytes.

Another issue is that, if the alloca size is very small, I think we'd better keep using the alloca instead of a runtime function. In GC v1 I remember that we have a size limit at ~128 bytes.

This is what I concern about, the current pass is trying to convert all the alloca (no matter the size) into runtime allocator which is not that efficient.

Thanks for the suggestion.

use getParentWithTrait<OpTrait::AutomaticAllocationScope>() to locate the dealloc op's position.

need to use a while loop to find the firstDeallocOp to ensure the deallocation order obeys FILO.

// Move the insertion point back if the operation before is a deallocOp Operation *currentOp = builder.getInsertionPoint()->getPrevNode(); while (currentOp && isa<cpuruntime::DeallocOp>(currentOp)) { // Move the insertion point before the found deallocOp builder.setInsertionPoint(currentOp); currentOp = currentOp->getPrevNode(); }

limit size is added.

need to use a while loop to find the firstDeallocOp to ensure the deallocation order obeys FILO.

Please double check the IR result of promote-allocation-to-stack. AFAIK, there is no dealloc for alloca ops - it is "deallocated" via the end AutomaticAllocationScope.

Yes, from promote-allocation-to-stack to the end of deallocation pipeline, no dealloc will be created for alloca ops. It will be dealloced automatically at the end of AutomaticAllocationScope.

The above while loop only searches cpuruntime::deallocOp, to make sure the generated dealloc order as FILO.

why do we need to check cpuruntime::deallocOp that has just been created by us in this pass?

When we walk through the AutomaticAllocationScope. If we get the below IR, the first allocated buffer m0 will be first deallocated at the end of the scope. Then the next allocated buffer m1 will be the next op to create the dealloc op, whose position should be the previous node of dealloc %m0

func.func @alloca_sequence() { // CHECK-LABEL: func @alloca_sequence() // CHECK: %[[m0:.*]] = cpuruntime.alloc() : memref<128xf32> // CHECK: %[[m1:.*]] = cpuruntime.alloc() : memref<128xf32> // CHECK: %[[m2:.*]] = cpuruntime.alloc() : memref<128xf32> %alloc = memref.alloca() : memref<128xf32> %alloc_0 = memref.alloca() : memref<128xf32> %alloc_1 = memref.alloca() : memref<128xf32> // CHECK: cpuruntime.dealloc %[[m2]] : memref<128xf32> // CHECK: cpuruntime.dealloc %[[m1]] : memref<128xf32> // CHECK: cpuruntime.dealloc %[[m0]] : memref<128xf32> return }

traverse the allocaOp first, and then insert deallocOp in a reverse order now.

…cator

crazydemo · 2024-09-03T02:14:04Z

@ciyongch @Menooker Could you please help review this PR?

ciyongch · 2024-09-03T02:53:37Z

lib/gc/Transforms/MemRefToCPURuntime.cpp

+      funcOp.walk([&](memref::AllocaOp allocaOp) {
+        uint64_t allocSize =
+            getMemRefSizeInBytes(allocaOp.getResult().getType());
+        if (allocSize < 128) {


Please using a meaningful var (sth like STACK_ALLOC_THRESHOLD) for the number here.

Thanks for the advice. Fixed.

lib/gc/Transforms/MemRefToCPURuntime.cpp

crazydemo · 2024-09-03T05:07:29Z

tests are updated.

crazydemo added ready to review CPU labels Aug 27, 2024

crazydemo requested review from Menooker, ciyongch, WangJialei-A and yifeizh2 August 27, 2024 07:12

crazydemo linked an issue Aug 27, 2024 that may be closed by this pull request

bf16 MLP failed with memory pool enabled #278

Closed

crazydemo removed the ready to review label Aug 27, 2024

fallback memref transformation when alloc / dealloc not in FILO fashion

71e675b

crazydemo force-pushed the zhangyan/enhance_allocator branch from 7de1cda to 71e675b Compare August 27, 2024 07:26

crazydemo added the ready to review label Aug 27, 2024

crazydemo changed the title ~~Fallback memref transformation when alloc / dealloc not in FILO fashion~~ [Transform] Fallback memref transformation when alloc / dealloc not in FILO fashion Aug 27, 2024

ciyongch reviewed Aug 27, 2024

View reviewed changes

add test case

bd4d269

ZhangYan added 2 commits August 27, 2024 06:16

Merge remote-tracking branch 'origin/main' into zhangyan/enhance_allo…

2329aff

…cator

remove alias

8a7f5c4

Menooker reviewed Aug 28, 2024

View reviewed changes

Yun-Fly mentioned this pull request Aug 28, 2024

Correctness check fail #284

Closed

Yun-Fly linked an issue Aug 28, 2024 that may be closed by this pull request

Correctness check fail #284

Closed

zhczhong and others added 2 commits August 27, 2024 21:41

align gc bufferization pipeline

dda790d

stash

cf4ad61

crazydemo removed the ready to review label Aug 28, 2024

ZhangYan added 4 commits August 27, 2024 23:38

Merge remote-tracking branch 'origin/zhicong/align_bufferization_pipe…

d2ed37f

…line' into zhangyan/enhance_allocator

fix comment

2cf5dcf

add test case

d987fa2

Merge remote-tracking branch 'origin/main' into zhangyan/enhance_allo…

6ff8ef8

…cator

crazydemo added the WIP work in progress label Aug 29, 2024

only convert stack-style alloca to cgc runtime allocator, with deallo…

87a51b1

…c inserted

crazydemo changed the title ~~[Transform] Fallback memref transformation when alloc / dealloc not in FILO fashion~~ [Transform] Only use gc runtime allocator for stack-like alloca ops Aug 29, 2024

update

3fdce87

crazydemo added ready to review and removed WIP work in progress labels Aug 29, 2024

ciyongch reviewed Aug 29, 2024

View reviewed changes

Menooker reviewed Aug 30, 2024

View reviewed changes

update

9494fdc

crazydemo requested a review from niuxiaog August 30, 2024 06:15

Merge remote-tracking branch 'origin/main' into zhangyan/enhance_allo…

6668da9

…cator

crazydemo force-pushed the zhangyan/enhance_allocator branch from 64e10ef to 6668da9 Compare August 30, 2024 06:23

Merge remote-tracking branch 'origin/main' into zhangyan/enhance_allo…

280ca6c

…cator

ciyongch reviewed Sep 3, 2024

View reviewed changes

fix comment

06a580e

Menooker reviewed Sep 3, 2024

View reviewed changes

lib/gc/Transforms/MemRefToCPURuntime.cpp Show resolved Hide resolved

add test

5ba2a01

Menooker approved these changes Sep 3, 2024

View reviewed changes

ciyongch approved these changes Sep 3, 2024

View reviewed changes

ciyongch merged commit d72b1f1 into main Sep 4, 2024
6 checks passed

[Transform] Only use gc runtime allocator for stack-like alloca ops #287

[Transform] Only use gc runtime allocator for stack-like alloca ops #287

Uh oh!

Conversation

crazydemo commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciyongch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crazydemo commented Aug 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crazydemo commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciyongch left a comment

Choose a reason for hiding this comment

Uh oh!

crazydemo commented Aug 29, 2024 • edited by ciyongch Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crazydemo commented Sep 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crazydemo commented Sep 3, 2024

Uh oh!

Uh oh!

Uh oh!

crazydemo commented Aug 27, 2024 •

edited

Loading

crazydemo commented Aug 28, 2024 •

edited

Loading

crazydemo commented Aug 29, 2024 •

edited by ciyongch

Loading