Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LLVM's stack growable #644

Merged
merged 6 commits into from
Oct 23, 2024
Merged

Make LLVM's stack growable #644

merged 6 commits into from
Oct 23, 2024

Conversation

marvinborner
Copy link
Member

This enables automatic growing of LLVM's stack once the end is reached (or memory larger than the remaining space is allocated).

We can now have much smaller stack sizes than before. For now, I've set the previous 256M stacks to 1KB (which, after initial tests, seems fine).

Due to some code repetition I wanted to merge the growing logic of regions with this. I've reverted my attempts since #642 removes the duplicated code anyway.

libraries/llvm/rts.ll Outdated Show resolved Hide resolved
@marvinborner
Copy link
Member Author

marvinborner commented Oct 16, 2024

Figure_1

The benchmarks use config_llvm.txt similar to effekt-plots. Current master is in the left bar. I can't really explain the sudden change near the end (and in master), but the jumps are fully reproducible.

The hacky script I've used to generate the data: https://gist.github.com/marvinborner/f71e3fdf55548692c7eb7b8d348d4284


%size = sub i64 %intStackPointer, %intBase
%double = mul i64 %size, 2
%newSize = add i64 %double, %n ; TODO: should we be smarter here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who doesn't understand the LLVM runtime, why do we need to add the %n here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly related to the runtime, but rather the growing logic. Since the bytes to be allocated (n) could be larger than the double of the current stack size, I first double the size and then add n to it. We could (should?) of course do something smarter here.

Copy link
Contributor

@jiribenes jiribenes Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, I'd expect something like "if we know we have to resize, then new_size <- nextStrictlyBiggestPowerOfTwo(current_size, n)" where the mysterious function is https://llvm.org/doxygen/namespacellvm.html#afb65eef479f0473d0fe1666b80155237 or clz (get the highest bit, choose a number one bigger in binary), but I have to think about it.

Copy link
Member Author

@marvinborner marvinborner Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that also sounds good. My thought behind the strategy was that, intuitively, large stack allocations are quite rare (or, in effekt's case, impossible?). So hypothetically, if I allocate 1GB on a 10MB stack, I'd prefer the next stack size to be 20MB + 1GB rather than 2GB.

In hindsight this hypothetical doesn't really make sense because we only use this function to allocate really small sizes 🤷‍♂️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jiribenes jiribenes Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it for a bit and something like the following should be pretty good:

new_size := nextPowerOf2(max(size * 2, size + n + 64))

Rationale:

  • aligning to next power of two is Good ®️
  • if the allocation is huge compared to the current size, it's best to keep at least a little bit (64b) of space extra so that we don't have to alloc again soon

Here's my crappy LLVM impl:

    ; Calculate double of current size
    %double_size = shl i64 %size, 1
    
    ; Calculate size + n + 64 (small buffer)
    %size_plus_n = add i64 %size, %n
    %size_plus_n_buffer = add i64 %size_plus_n, 64
    
    ; Take the maximum of (size * 2) and (size + n + 64)
    %max_size = call i64 @llvm.maximum.i64(i64 %double_size, i64 %size_plus_n_buffer)
    
    ; Round up to the next power of 2 using ctlz
    %leading_zeros = call i64 @llvm.ctlz.i64(i64 %max_size, i1 false)
    %shift_amount = sub i64 63, %leading_zeros
    %power_of_two = shl i64 1, %shift_amount
    %newSize = select i64 %power_of_two, i64 %max_size, i1 icmp eq i64 %power_of_two, %max_size

Of course, perhaps I'm really overthinking this, I'd really need to benchmark.
Also, I'm not sure whether we should do size * 2, size * 1.5 or size * <golden ratio>. Probably again needs to be benchmarked.

To continue thinking about this, it would be really nice to know the "profile" of allocations: how do our allocations actually look like? Can we serialise this somehow and then read later?
(see summary below where I discover that I have no clue how to model this)

WDYT?


btw, @jiribenes tried to do statistics here

I also thought about my assumptions:

  • the number of allocations follows a Pareto distribution
  • the size of an allocation is a power of two and at least 64, and follows a Pareto distribution

but when I try to even simulate them, I can clearly see that they are not true, since they result in very whacky stacks with the function I suggested under the distribution above:

Initial size: 1024 bytes
  Trimmed Mean size: 437134839.21 bytes
  Median size: 524288.00 bytes
  90th percentile size: 1073741824.00 bytes
  Max size: 1073741824.00 bytes

Initial size: 4096 bytes
  Trimmed Mean size: 452582561.59 bytes
  Median size: 2097152.00 bytes
  90th percentile size: 1073741824.00 bytes
  Max size: 1073741824.00 bytes

Initial size: 8192 bytes
  Trimmed Mean size: 461104562.45 bytes
  Median size: 4194304.00 bytes
  90th percentile size: 1073741824.00 bytes
  Max size: 1073741824.00 bytes

Of course, my code is most likely bad, but this is not very encouraging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, although I agree that maybe you're overthinking this a bit :octocat:

Regarding the profile of the allocations: We (currently?) only use @stackAllocate to make space for pushes of pos/neg, builtin types and other pointers, never for arbitrarily large data allocations. (Maybe @phischu can confirm this?)

As far as I understand, large (de-)allocations only happen when some function has a lot of arguments that need to be stored on the stack (~8-16 byte each). In nqueens, for example, this leads to stack (de-)allocations of 108 bytes.

To make this more clear: Allocation sizes/amount in all generated files in effekt.llvmtests:

bytes allocated how often
8 318
16 305
24 1856
32 101
40 57
48 22
56 24
64 4
72 1
80 1

So I don't think we really need to be creative with the growing logic, since %n will almost never be larger than the doubled current stack size, especially if we settle on an initial stack size like 1024. Minimal solutions like adding the current size (as I did), or by using NextPowerOf2 are probably more than enough.

Copy link
Member Author

@marvinborner marvinborner Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phischu asked about the stack allocation profile at runtime. Here's the data:

For the entire test suite (test):

bytes allocated how often percentage
8 2918774 28.8
16 815964 8
24 5386483 53.1
32 841972 8.3
40 169619 1.7
48 6444 0.06
56 79 <0.01
64 2085 0.02
72 126 <0.01
80 707 <0.01
96 8 <0.01
112 1 <0.01

Only the LLVM tests (testOnly effekt.LLVMTests):

bytes allocated how often percentage
8 2918721 28.8
16 815664 8
24 5385418 53.1
32 841852 8.3
40 169604 1.7
48 6435 0.06
56 79 <0.01
64 2080 0.02
72 126 <0.01
80 702 <0.01

@b-studios
Copy link
Collaborator

Thanks for the benchmarks, do I read it correctly that triples and tree_explore basically disappear, because they are much faster?

@b-studios
Copy link
Collaborator

Most continuation heavy benchmarks seem to be much faster, except parsing dollars. That's strange

@b-studios b-studios requested review from b-studios, phischu and abgruszecki and removed request for abgruszecki and b-studios October 22, 2024 15:04
libraries/llvm/rts.ll Outdated Show resolved Hide resolved
@b-studios b-studios merged commit cb2f439 into master Oct 23, 2024
2 checks passed
@b-studios b-studios deleted the feature/llvm-stack-growing branch October 23, 2024 14:47
@b-studios
Copy link
Collaborator

Thanks @marvinborner !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants