Skip to content

[Transform] Only use gc runtime allocator for stack-like alloca ops #287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Sep 4, 2024

Conversation

crazydemo
Copy link

@crazydemo crazydemo commented Aug 27, 2024

use createPromoteBuffersToStackPass to convert all legal stack-like allocations into alloca, and only do gc runtime allocator transformation on allocaOp, the corresponding dealloc is inserted.

Fix Issue#279, Issue#278

@crazydemo crazydemo force-pushed the zhangyan/enhance_allocator branch from 7de1cda to 71e675b Compare August 27, 2024 07:26
@crazydemo crazydemo changed the title Fallback memref transformation when alloc / dealloc not in FILO fashion [Transform] Fallback memref transformation when alloc / dealloc not in FILO fashion Aug 27, 2024
Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to add a simple case to cover this transform fallback instead of throwing error?

} else if (isa<memref::DeallocOp>(op)) {
// fallback alloc / dealloc not in FILO fashion
Value deallocMemref = op->getOperands().front();
auto aliases = analysis.resolveReverse(deallocMemref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the aliases for analyzing deallocMemref vs AllocMemref? They'll be exactly same, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to createBufferDeallocationPass, deallocMemref will be the same as AllocMemref. But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass. Do you think we can omit this possibility, and do not use alias analysis?

func.func @alloc_dealloc_view() {
  // CHECK-LABEL: func @alloc_dealloc_view()
  // CHECK: %[[m0:.*]] = cpuruntime.alloc() : memref<128xi8>
  %c0 = arith.constant 0: index
  %f0 = arith.constant 0.0: f32
  %alloc = memref.alloc() : memref<128xi8>
  %view = memref.view %alloc[%c0][] : memref<128xi8> to memref<32xf32>
  %subview = memref.subview %view[0][16][1] : memref<32xf32> to memref<16xf32>
  // CHECK: cpuruntime.dealloc
  memref.dealloc %subview : memref<16xf32>
  return
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have such case when it is not stack-like?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is from Issue#279. After bufferDeallocation pass, we will get the following IR, whose alloc / dealloc are in FIFO fashion. This case should be resolved by PR#283, with MergeAlloc pass enabled, thus the small temp buffers will be merged into a larger buffer.
It will be better that the stack-like alloc / dealloc to be checked, in case we still have some non-stack like allocations in some complex scenarios.

scf.parallel (%arg11) = (%c0_3) to (%c32_4) step (%c16_5) {
      %alloc_18 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xbf16>
      %subview = memref.subview %arg2[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
      %subview_19 = memref.subview %expand_shape[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
      scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
        %3 = memref.load %subview[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
        %4 = memref.load %subview_19[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
        %5 = arith.extf %4 fastmath<contract> : bf16 to f32
        %6 = arith.extf %3 fastmath<contract> : bf16 to f32
        %7 = arith.addf %6, %5 : f32
        %8 = arith.truncf %7 fastmath<contract> : f32 to bf16
        memref.store %8, %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16>
        scf.reduce 
      }
      %alloc_20 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32>
      scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
        %3 = memref.load %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16>
        %4 = arith.extf %3 : bf16 to f32
        memref.store %4, %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        scf.reduce 
      }
      memref.dealloc %alloc_18 : memref<1x16x4096xbf16>
      %alloc_21 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32>
      scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
        %3 = memref.load %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        %4 = memref.load %0[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        %5 = math.powf %3, %4 : f32
        memref.store %5, %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        scf.reduce 
      }
      memref.dealloc %alloc_20 : memref<1x16x4096xf32>
      %alloc_22 = memref.alloc() {alignment = 64 : i64} : memref<1x16xf32>
      scf.parallel (%arg12, %arg13) = (%c0, %c0) to (%c1, %c16) step (%c1, %c1) {
        memref.store %cst, %alloc_22[%arg12, %arg13] : memref<1x16xf32>
        scf.for %arg14 = %c0 to %c4096 step %c1 {
          %3 = memref.load %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
          %4 = memref.load %alloc_22[%arg12, %arg13] : memref<1x16xf32>
          %5 = arith.addf %3, %4 : f32
          memref.store %5, %alloc_22[%arg12, %arg13] : memref<1x16xf32>
        }
        scf.reduce 
      }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass. Do you think we can omit this possibility, and do not use alias analysis?

I don't think it's valid to dealloc/free the partial buffer by either view or subivew of memref.alloc, please check the statement here: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td#L552-L555

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out this reference. I will remove the alias.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have such case when it is not stack-like?

Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass, this fallback can be a workaround before the memref-merge patch is pulled in or in case there'll be non FILO alloc/dealloc scenarios.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alloc/free used to be FILO (stack like). Is there any recent optimization in upstream changing that?

Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass,

The buffer-merge pass is an optimization and can be turned off at -O0 in the future. I suppose we should always make this pass safe to run.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are handling dynamic shaped buffers, the buffer-merge won't work and we still need many small allocations. I would suggest in our pass pipeline that if stack-like allocator is on, we should try to make sure BufferDeallocationPipeline maintain a FILO pattern. If stack-like allocator is off, we can enable non-FILO pattern in BufferDeallocationPipeline

@crazydemo
Copy link
Author

It will be good to add a simple case to cover this transform fallback instead of throwing error?

added test case.

@@ -96,6 +97,24 @@ struct ConvertMemRefToCPURuntime
}
}
}
} else if (isa<memref::AllocOp>(op)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that the walk() order may differ from the execution order. And what about the case

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     dealloc(b)
     dealloc(a)
     return c
}

The optimization still applies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     if(...) {
        dealloc(c)
     } else {
        dealloc(b)
     }
     dealloc(a)
}

It looks like stack-like, but it is not.

I would suggest a stricter checking:

  1. first, check if an alloc is "leaked" from the scope (already done in your code in main branch)
  2. for the non-leaked allocations only: check if the order of alloc/free is stack-like in the same Region, and ignore the leaked allocations. Usually, we can ensure the execution order in the same region.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     if(...) {
        dealloc(c)
     } else {
        dealloc(b)
     }
     dealloc(a)
}

Not sure if it's a valid case or not, as it will always result in memory leak?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     if(...) {
        dealloc(c)
     } else {
        dealloc(b)
     }
     dealloc(a)
}

Not sure if it's a valid case or not, as it will always result in memory leak?

I think we are targeting a general MLIR compiler, so any of the user input should be handled. Users can always write code like this. At least it will not produce wrong result - if they can live with memory leak.

@Yun-Fly Yun-Fly mentioned this pull request Aug 28, 2024
@Yun-Fly Yun-Fly linked an issue Aug 28, 2024 that may be closed by this pull request
@crazydemo
Copy link
Author

crazydemo commented Aug 28, 2024

PR#295 aligns the bufferization pipeline with the upstream, and can fix Issue#279, Issue#278, Issue#284.

The current PR needs to be modified, seems allocaOp can be transformed to gc runtime allocator.

@crazydemo crazydemo added the WIP work in progress label Aug 29, 2024
@crazydemo crazydemo changed the title [Transform] Fallback memref transformation when alloc / dealloc not in FILO fashion [Transform] Only use gc runtime allocator for stack-like alloca ops Aug 29, 2024
@crazydemo crazydemo added ready to review and removed WIP work in progress labels Aug 29, 2024
Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change tries to resolve the FILO allocation issue, but also result in converting all the stack allocations(even with small size) into runtime allocator, maybe not that efficiency?

@crazydemo
Copy link
Author

crazydemo commented Aug 29, 2024

This change tries to resolve the FILO allocation issue, but also result in converting all the stack allocations(even with small size) into runtime allocator, maybe not that efficiency?

There're several ways to solve the alloc / dealloc style issue:

  1. the current PR's solution, reuse the PromoteBuffersToStackPass.
  2. modify the upstream bufferization pipeline, may need a lot discussion with the community.
  3. create a GC pass, do the deallocation on GC side, may contain a large amount of logic that is redundant with upstream.
  4. create a GC pass, reorganize the generated memref.alloc / memref.dealloc ops. The IR generated by the bufferization pipeline tends to have dealloc operations concentrated at the end of a region, while the order of deallocs is FIFO. Writing this pass should be kind of straightforward, but it requires extensive testing.

We can go with item1, but needs some enhancement to distinguish the buffer size.

if (firstDeallocOp) {
builder.setInsertionPoint(firstDeallocOp);
} else {
// If no deallocOp was found, insert at the end of the region before

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please insert deallocop at the end of the parent alloca_scope op. It may not be the same as op->getParentRegion()

And I am not sure why we need to find firstDeallocOp?

Another issue is that, if the alloca size is very small, I think we'd better keep using the alloca instead of a runtime function. In GC v1 I remember that we have a size limit at ~128 bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that, if the alloca size is very small, I think we'd better keep using the alloca instead of a runtime function. In GC v1 I remember that we have a size limit at ~128 bytes.

This is what I concern about, the current pass is trying to convert all the alloca (no matter the size) into runtime allocator which is not that efficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.

  1. use getParentWithTrait<OpTrait::AutomaticAllocationScope>() to locate the dealloc op's position.
  2. need to use a while loop to find the firstDeallocOp to ensure the deallocation order obeys FILO.
// Move the insertion point back if the operation before is a deallocOp
      Operation *currentOp = builder.getInsertionPoint()->getPrevNode();
      while (currentOp && isa<cpuruntime::DeallocOp>(currentOp)) {
        // Move the insertion point before the found deallocOp
        builder.setInsertionPoint(currentOp);
        currentOp = currentOp->getPrevNode();
      }
  1. limit size is added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to use a while loop to find the firstDeallocOp to ensure the deallocation order obeys FILO.

Please double check the IR result of promote-allocation-to-stack. AFAIK, there is no dealloc for alloca ops - it is "deallocated" via the end AutomaticAllocationScope.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from promote-allocation-to-stack to the end of deallocation pipeline, no dealloc will be created for alloca ops. It will be dealloced automatically at the end of AutomaticAllocationScope.

The above while loop only searches cpuruntime::deallocOp, to make sure the generated dealloc order as FILO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check cpuruntime::deallocOp that has just been created by us in this pass?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we walk through the AutomaticAllocationScope. If we get the below IR, the first allocated buffer m0 will be first deallocated at the end of the scope. Then the next allocated buffer m1 will be the next op to create the dealloc op, whose position should be the previous node of dealloc %m0

func.func @alloca_sequence() {
  // CHECK-LABEL: func @alloca_sequence()
  // CHECK: %[[m0:.*]] = cpuruntime.alloc() : memref<128xf32>
  // CHECK: %[[m1:.*]] = cpuruntime.alloc() : memref<128xf32>
  // CHECK: %[[m2:.*]] = cpuruntime.alloc() : memref<128xf32>
  %alloc = memref.alloca() : memref<128xf32>
  %alloc_0 = memref.alloca() : memref<128xf32>
  %alloc_1 = memref.alloca() : memref<128xf32>
  // CHECK: cpuruntime.dealloc %[[m2]] : memref<128xf32>
  // CHECK: cpuruntime.dealloc %[[m1]] : memref<128xf32>
  // CHECK: cpuruntime.dealloc %[[m0]] : memref<128xf32>
  return
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traverse the allocaOp first, and then insert deallocOp in a reverse order now.

@crazydemo crazydemo requested a review from niuxiaog August 30, 2024 06:15
@crazydemo crazydemo force-pushed the zhangyan/enhance_allocator branch from 64e10ef to 6668da9 Compare August 30, 2024 06:23
@crazydemo
Copy link
Author

@ciyongch @Menooker Could you please help review this PR?

funcOp.walk([&](memref::AllocaOp allocaOp) {
uint64_t allocSize =
getMemRefSizeInBytes(allocaOp.getResult().getType());
if (allocSize < 128) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please using a meaningful var (sth like STACK_ALLOC_THRESHOLD) for the number here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. Fixed.

@crazydemo
Copy link
Author

tests are updated.

@ciyongch ciyongch merged commit d72b1f1 into main Sep 4, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Correctness check fail Corrupt stack detected at runtime bf16 MLP failed with memory pool enabled
4 participants