-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LLVM's stack growable #644
Conversation
The benchmarks use The hacky script I've used to generate the data: https://gist.github.com/marvinborner/f71e3fdf55548692c7eb7b8d348d4284 |
libraries/llvm/rts.ll
Outdated
|
||
%size = sub i64 %intStackPointer, %intBase | ||
%double = mul i64 %size, 2 | ||
%newSize = add i64 %double, %n ; TODO: should we be smarter here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who doesn't understand the LLVM runtime, why do we need to add the %n
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly related to the runtime, but rather the growing logic. Since the bytes to be allocated (n
) could be larger than the double of the current stack size, I first double the size and then add n
to it. We could (should?) of course do something smarter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively, I'd expect something like "if we know we have to resize, then new_size <- nextStrictlyBiggestPowerOfTwo(current_size, n)
" where the mysterious function is https://llvm.org/doxygen/namespacellvm.html#afb65eef479f0473d0fe1666b80155237 or clz
(get the highest bit, choose a number one bigger in binary), but I have to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that also sounds good. My thought behind the strategy was that, intuitively, large stack allocations are quite rare (or, in effekt's case, impossible?). So hypothetically, if I allocate 1GB on a 10MB stack, I'd prefer the next stack size to be 20MB + 1GB rather than 2GB.
In hindsight this hypothetical doesn't really make sense because we only use this function to allocate really small sizes 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth looking at
https://github.com/golang/go/blob/24cb743d1faa6c8f612faa3c17ac9de5cc385832/src/runtime/stack.go#L336
and it's callsites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it for a bit and something like the following should be pretty good:
new_size := nextPowerOf2(max(size * 2, size + n + 64))
Rationale:
- aligning to next power of two is Good ®️
- if the allocation is huge compared to the current size, it's best to keep at least a little bit (64b) of space extra so that we don't have to alloc again soon
Here's my crappy LLVM impl:
; Calculate double of current size
%double_size = shl i64 %size, 1
; Calculate size + n + 64 (small buffer)
%size_plus_n = add i64 %size, %n
%size_plus_n_buffer = add i64 %size_plus_n, 64
; Take the maximum of (size * 2) and (size + n + 64)
%max_size = call i64 @llvm.maximum.i64(i64 %double_size, i64 %size_plus_n_buffer)
; Round up to the next power of 2 using ctlz
%leading_zeros = call i64 @llvm.ctlz.i64(i64 %max_size, i1 false)
%shift_amount = sub i64 63, %leading_zeros
%power_of_two = shl i64 1, %shift_amount
%newSize = select i64 %power_of_two, i64 %max_size, i1 icmp eq i64 %power_of_two, %max_size
Of course, perhaps I'm really overthinking this, I'd really need to benchmark.
Also, I'm not sure whether we should do size * 2
, size * 1.5
or size * <golden ratio>
. Probably again needs to be benchmarked.
To continue thinking about this, it would be really nice to know the "profile" of allocations: how do our allocations actually look like? Can we serialise this somehow and then read later?
(see summary below where I discover that I have no clue how to model this)
WDYT?
btw, @jiribenes tried to do statistics here
I also thought about my assumptions:
- the number of allocations follows a Pareto distribution
- the size of an allocation is a power of two and at least 64, and follows a Pareto distribution
but when I try to even simulate them, I can clearly see that they are not true, since they result in very whacky stacks with the function I suggested under the distribution above:
Initial size: 1024 bytes
Trimmed Mean size: 437134839.21 bytes
Median size: 524288.00 bytes
90th percentile size: 1073741824.00 bytes
Max size: 1073741824.00 bytes
Initial size: 4096 bytes
Trimmed Mean size: 452582561.59 bytes
Median size: 2097152.00 bytes
90th percentile size: 1073741824.00 bytes
Max size: 1073741824.00 bytes
Initial size: 8192 bytes
Trimmed Mean size: 461104562.45 bytes
Median size: 4194304.00 bytes
90th percentile size: 1073741824.00 bytes
Max size: 1073741824.00 bytes
Of course, my code is most likely bad, but this is not very encouraging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting, although I agree that maybe you're overthinking this a bit
Regarding the profile of the allocations: We (currently?) only use @stackAllocate
to make space for pushes of pos/neg, builtin types and other pointers, never for arbitrarily large data allocations. (Maybe @phischu can confirm this?)
As far as I understand, large (de-)allocations only happen when some function has a lot of arguments that need to be stored on the stack (~8-16 byte each). In nqueens
, for example, this leads to stack (de-)allocations of 108 bytes.
To make this more clear: Allocation sizes/amount in all generated files in effekt.llvmtests
:
bytes allocated | how often |
---|---|
8 | 318 |
16 | 305 |
24 | 1856 |
32 | 101 |
40 | 57 |
48 | 22 |
56 | 24 |
64 | 4 |
72 | 1 |
80 | 1 |
So I don't think we really need to be creative with the growing logic, since %n
will almost never be larger than the doubled current stack size, especially if we settle on an initial stack size like 1024. Minimal solutions like adding the current size (as I did), or by using NextPowerOf2
are probably more than enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phischu asked about the stack allocation profile at runtime. Here's the data:
For the entire test suite (test
):
bytes allocated | how often | percentage |
---|---|---|
8 | 2918774 | 28.8 |
16 | 815964 | 8 |
24 | 5386483 | 53.1 |
32 | 841972 | 8.3 |
40 | 169619 | 1.7 |
48 | 6444 | 0.06 |
56 | 79 | <0.01 |
64 | 2085 | 0.02 |
72 | 126 | <0.01 |
80 | 707 | <0.01 |
96 | 8 | <0.01 |
112 | 1 | <0.01 |
Only the LLVM tests (testOnly effekt.LLVMTests
):
bytes allocated | how often | percentage |
---|---|---|
8 | 2918721 | 28.8 |
16 | 815664 | 8 |
24 | 5385418 | 53.1 |
32 | 841852 | 8.3 |
40 | 169604 | 1.7 |
48 | 6435 | 0.06 |
56 | 79 | <0.01 |
64 | 2080 | 0.02 |
72 | 126 | <0.01 |
80 | 702 | <0.01 |
Thanks for the benchmarks, do I read it correctly that |
Most continuation heavy benchmarks seem to be much faster, except parsing dollars. That's strange |
Thanks @marvinborner ! |
This enables automatic growing of LLVM's stack once the end is reached (or memory larger than the remaining space is allocated).
We can now have much smaller stack sizes than before. For now, I've set the previous 256M stacks to 1KB (which, after initial tests, seems fine).
Due to some code repetition I wanted to merge the growing logic of regions with this. I've reverted my attempts since #642 removes the duplicated code anyway.