-
Notifications
You must be signed in to change notification settings - Fork 17
[Transform] Only use gc runtime allocator for stack-like alloca ops #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7de1cda
to
71e675b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to add a simple case to cover this transform fallback instead of throwing error?
} else if (isa<memref::DeallocOp>(op)) { | ||
// fallback alloc / dealloc not in FILO fashion | ||
Value deallocMemref = op->getOperands().front(); | ||
auto aliases = analysis.resolveReverse(deallocMemref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the aliases for analyzing deallocMemref vs AllocMemref? They'll be exactly same, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to createBufferDeallocationPass
, deallocMemref
will be the same as AllocMemref
. But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass
. Do you think we can omit this possibility, and do not use alias analysis?
func.func @alloc_dealloc_view() {
// CHECK-LABEL: func @alloc_dealloc_view()
// CHECK: %[[m0:.*]] = cpuruntime.alloc() : memref<128xi8>
%c0 = arith.constant 0: index
%f0 = arith.constant 0.0: f32
%alloc = memref.alloc() : memref<128xi8>
%view = memref.view %alloc[%c0][] : memref<128xi8> to memref<32xf32>
%subview = memref.subview %view[0][16][1] : memref<32xf32> to memref<16xf32>
// CHECK: cpuruntime.dealloc
memref.dealloc %subview : memref<16xf32>
return
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have such case when it is not stack-like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is from Issue#279. After bufferDeallocation pass
, we will get the following IR, whose alloc / dealloc
are in FIFO
fashion. This case should be resolved by PR#283, with MergeAlloc pass
enabled, thus the small temp buffers will be merged into a larger buffer.
It will be better that the stack-like alloc / dealloc to be checked, in case we still have some non-stack like allocations in some complex scenarios.
scf.parallel (%arg11) = (%c0_3) to (%c32_4) step (%c16_5) {
%alloc_18 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xbf16>
%subview = memref.subview %arg2[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
%subview_19 = memref.subview %expand_shape[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
%3 = memref.load %subview[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
%4 = memref.load %subview_19[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
%5 = arith.extf %4 fastmath<contract> : bf16 to f32
%6 = arith.extf %3 fastmath<contract> : bf16 to f32
%7 = arith.addf %6, %5 : f32
%8 = arith.truncf %7 fastmath<contract> : f32 to bf16
memref.store %8, %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16>
scf.reduce
}
%alloc_20 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32>
scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
%3 = memref.load %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16>
%4 = arith.extf %3 : bf16 to f32
memref.store %4, %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
scf.reduce
}
memref.dealloc %alloc_18 : memref<1x16x4096xbf16>
%alloc_21 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32>
scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
%3 = memref.load %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
%4 = memref.load %0[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
%5 = math.powf %3, %4 : f32
memref.store %5, %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
scf.reduce
}
memref.dealloc %alloc_20 : memref<1x16x4096xf32>
%alloc_22 = memref.alloc() {alignment = 64 : i64} : memref<1x16xf32>
scf.parallel (%arg12, %arg13) = (%c0, %c0) to (%c1, %c16) step (%c1, %c1) {
memref.store %cst, %alloc_22[%arg12, %arg13] : memref<1x16xf32>
scf.for %arg14 = %c0 to %c4096 step %c1 {
%3 = memref.load %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
%4 = memref.load %alloc_22[%arg12, %arg13] : memref<1x16xf32>
%5 = arith.addf %3, %4 : f32
memref.store %5, %alloc_22[%arg12, %arg13] : memref<1x16xf32>
}
scf.reduce
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass. Do you think we can omit this possibility, and do not use alias analysis?
I don't think it's valid to dealloc/free the partial buffer by either view
or subivew
of memref.alloc
, please check the statement here: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td#L552-L555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out this reference. I will remove the alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have such case when it is not stack-like?
Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass, this fallback can be a workaround before the memref-merge patch is pulled in or in case there'll be non FILO alloc/dealloc scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alloc/free used to be FILO (stack like). Is there any recent optimization in upstream changing that?
Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass,
The buffer-merge pass is an optimization and can be turned off at -O0
in the future. I suppose we should always make this pass safe to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are handling dynamic shaped buffers, the buffer-merge won't work and we still need many small allocations. I would suggest in our pass pipeline that if stack-like allocator is on, we should try to make sure BufferDeallocationPipeline
maintain a FILO pattern. If stack-like allocator is off, we can enable non-FILO pattern in BufferDeallocationPipeline
added test case. |
@@ -96,6 +97,24 @@ struct ConvertMemRefToCPURuntime | |||
} | |||
} | |||
} | |||
} else if (isa<memref::AllocOp>(op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that the walk()
order may differ from the execution order. And what about the case
func AAA() {
a=alloc()
b=alloc()
c=alloc()
dealloc(b)
dealloc(a)
return c
}
The optimization still applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?
func AAA() {
a=alloc()
b=alloc()
c=alloc()
if(...) {
dealloc(c)
} else {
dealloc(b)
}
dealloc(a)
}
It looks like stack-like, but it is not.
I would suggest a stricter checking:
- first, check if an alloc is "leaked" from the scope (already done in your code in main branch)
- for the non-leaked allocations only: check if the order of alloc/free is stack-like in the same Region, and ignore the leaked allocations. Usually, we can ensure the execution order in the same region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?
func AAA() { a=alloc() b=alloc() c=alloc() if(...) { dealloc(c) } else { dealloc(b) } dealloc(a) }
Not sure if it's a valid case or not, as it will always result in memory leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?
func AAA() { a=alloc() b=alloc() c=alloc() if(...) { dealloc(c) } else { dealloc(b) } dealloc(a) }
Not sure if it's a valid case or not, as it will always result in memory leak?
I think we are targeting a general MLIR compiler, so any of the user input should be handled. Users can always write code like this. At least it will not produce wrong result - if they can live with memory leak.
…line' into zhangyan/enhance_allocator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change tries to resolve the FILO allocation issue, but also result in converting all the stack allocations(even with small size) into runtime allocator, maybe not that efficiency?
There're several ways to solve the alloc / dealloc style issue:
We can go with item1, but needs some enhancement to distinguish the buffer size. |
if (firstDeallocOp) { | ||
builder.setInsertionPoint(firstDeallocOp); | ||
} else { | ||
// If no deallocOp was found, insert at the end of the region before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please insert deallocop at the end of the parent alloca_scope op. It may not be the same as op->getParentRegion()
And I am not sure why we need to find firstDeallocOp
?
Another issue is that, if the alloca size is very small, I think we'd better keep using the alloca instead of a runtime function. In GC v1 I remember that we have a size limit at ~128 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that, if the alloca size is very small, I think we'd better keep using the alloca instead of a runtime function. In GC v1 I remember that we have a size limit at ~128 bytes.
This is what I concern about, the current pass is trying to convert all the alloca (no matter the size) into runtime allocator which is not that efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
- use
getParentWithTrait<OpTrait::AutomaticAllocationScope>()
to locate the dealloc op's position. - need to use a while loop to find the
firstDeallocOp
to ensure the deallocation order obeys FILO.
// Move the insertion point back if the operation before is a deallocOp
Operation *currentOp = builder.getInsertionPoint()->getPrevNode();
while (currentOp && isa<cpuruntime::DeallocOp>(currentOp)) {
// Move the insertion point before the found deallocOp
builder.setInsertionPoint(currentOp);
currentOp = currentOp->getPrevNode();
}
- limit size is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to use a while loop to find the firstDeallocOp to ensure the deallocation order obeys FILO.
Please double check the IR result of promote-allocation-to-stack
. AFAIK, there is no dealloc for alloca ops - it is "deallocated" via the end AutomaticAllocationScope
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, from promote-allocation-to-stack
to the end of deallocation pipeline
, no dealloc will be created for alloca ops. It will be dealloced automatically at the end of AutomaticAllocationScope
.
The above while loop
only searches cpuruntime::deallocOp
, to make sure the generated dealloc order as FILO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check cpuruntime::deallocOp that has just been created by us in this pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we walk through the AutomaticAllocationScope
. If we get the below IR, the first allocated buffer m0
will be first deallocated at the end of the scope. Then the next allocated buffer m1
will be the next op to create the dealloc
op, whose position should be the previous node of dealloc %m0
func.func @alloca_sequence() {
// CHECK-LABEL: func @alloca_sequence()
// CHECK: %[[m0:.*]] = cpuruntime.alloc() : memref<128xf32>
// CHECK: %[[m1:.*]] = cpuruntime.alloc() : memref<128xf32>
// CHECK: %[[m2:.*]] = cpuruntime.alloc() : memref<128xf32>
%alloc = memref.alloca() : memref<128xf32>
%alloc_0 = memref.alloca() : memref<128xf32>
%alloc_1 = memref.alloca() : memref<128xf32>
// CHECK: cpuruntime.dealloc %[[m2]] : memref<128xf32>
// CHECK: cpuruntime.dealloc %[[m1]] : memref<128xf32>
// CHECK: cpuruntime.dealloc %[[m0]] : memref<128xf32>
return
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
traverse the allocaOp
first, and then insert deallocOp
in a reverse order now.
64e10ef
to
6668da9
Compare
funcOp.walk([&](memref::AllocaOp allocaOp) { | ||
uint64_t allocSize = | ||
getMemRefSizeInBytes(allocaOp.getResult().getType()); | ||
if (allocSize < 128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please using a meaningful var (sth like STACK_ALLOC_THRESHOLD) for the number here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. Fixed.
tests are updated. |
use
createPromoteBuffersToStackPass
to convert all legal stack-like allocations into alloca, and only do gc runtime allocator transformation onallocaOp
, the corresponding dealloc is inserted.Fix Issue#279, Issue#278