Skip to content

Optimize runtime for non single-module(01,02) compilation #14225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kostya
Copy link
Contributor

@kostya kostya commented Jan 12, 2024

All this changes questionable and I don't know even if it pass specs. Also optimization results was for my set of benchmarks, and I don't know if it would work for something else.

Changes:

  1. behavior of AlwaysInline, now AlwaysInline behave like it sounds, purely inline code on crystal level. Before it just marks it in llvm (so inline works only in current llvm module, no cross module inlining). Not sure if backtrace btw works. So now all methods marked by AlwaysInline, like headers in C++. Also mark by this AlwaysInline many small but critical methods, which have most slowdown in per module compilation. This marks was only done for methods, which I only found was to be slow, in my tests. But I am sure there also many methods which I missed.
  2. Also add likely intrinsic which should help in non single-module optimizations.

Results:

In this PR -O2 became is a decent level of compilation, which runtime is 0.9-2 times slower than release (in 1 test even faster), but compile time is 3-4 times faster, incremental compile time: 5-10 times faster.
--release just little faster after all this optimizations, because single-module and O3 doing similar thing with inlining.
On the recompile compiler this optimizations have no visible effect.

Example of optimization from some benchmarks:

brainfuck
Master: -O2: initial: 2.40s, incremental: 1.33s, run time: 19.71s
This branch: -O2: initial: 2.25s, incremental: 1.32s, run time: 4.90s
brainfuck runtime 4 times faster in this branch. and almost equal release result: 4.26s (which compile time is 6.74s)

@HertzDevil
Copy link
Contributor

There are multiple spec failures, so it looks like this has altered the semantics of some programs

@straight-shoota
Copy link
Member

This is really good, awesome work 👍

I think we'll need to look at the individual actions independently, though.

There is already an pending discussion about likely/unlikely intrinsices: #11910. We should continue this part there. I think there's a general consensus on supporting it in some way, but the concrete implementation is still up for debate.

I don't think there has been any dedicated discussion about inlining methods on the Crystal side, except what you already mentioned in #13505. So this would be a new discussion to start. The idea sounds definitely interesting. But there is a wide range of possibilities how it can be implemented.

Then I also see a number of other changes. Not sure what's the reason for adding parenthesis to yield expressions?
Using wrapping math operators in places where overflow is impossible should be an easy thing. I'm wondering if we could do something about this in the compiler even...

@straight-shoota straight-shoota marked this pull request as draft January 12, 2024 22:46
@kostya
Copy link
Contributor Author

kostya commented Jan 12, 2024

parenthesis remains because I try likely here, but somehow it compile error, so I remove it.

Optimizations which add huge impact:

  1. this line: https://github.com/crystal-lang/crystal/pull/14225/files#diff-160d80555792cf986b3f299649b59e7b973ac45a415cafe2978905ccfacce4d1R576, each_index is a method which used in almost every cycle. And remove checked add generate better asm.
  2. all inlines in scr/pointer.cr - Pointer used every where, so making call for it, is huge overhead.
    https://github.com/crystal-lang/crystal/pull/14225/files#diff-f424630a978047aee2dd7194fe51fea5a69342ab1a9de74342b0382f01dc6211L40

@zw963
Copy link
Contributor

zw963 commented Jan 13, 2024

Awesome work! @kostya although I don’t understand these codes, but look at the final result, it great!

@zw963
Copy link
Contributor

zw963 commented Jan 13, 2024

In fact, I hope the core members can raise the priority for this meaningful discussion.

Don't expect like #13464,it take eight months from proposal to merge, too long, it destroying the enthusiasm of OS contributors, it also destroying the enthusiasm of people using Crystal, In fact, I feel that if it weren’t the hard work of @kostya and @funny-falcon 's follow-up, #13464 , It probably never been merged until now.

Without the help of open source contributors, it would be difficult for Crystal to make great progress. just like the release of 1.11.0, many issue were discovered and fixed very quickly in 1.11.1 👍

@funny-falcon
Copy link
Contributor

Shouldn’t each_with_index specialized then for common collections? At least Array already have index when it iterates, no need for other var.

I see problem in untyped offset parameter of each_with_index and wrapping addition: what if user pass UInt8 as an offset? What if we iterate external storage, so its index could be larger than 2G?

@kostya
Copy link
Contributor Author

kostya commented Jan 13, 2024

instead of AlwaysInline, we can for example calculate somehow method complexity, and inline if it small, like <5 instructions(even with all unfolded calls) and not have cycles and not self recursion. But I not see how to do it in current compiler, because it generate llvm directly without semi form.

@kostya
Copy link
Contributor Author

kostya commented Jan 13, 2024

Another idea, mark full class with AlwaysInline, like Pointer and automatically inline all calls to it.

@kostya
Copy link
Contributor Author

kostya commented Jan 13, 2024

Also there is already small inlining in generate calls https://github.com/crystal-lang/crystal/blob/master/src/compiler/crystal/codegen/call.cr#L449, but it inline only primitives, self and getters. It also should inline setters for example.

@kostya
Copy link
Contributor Author

kostya commented Jan 16, 2024

I experiencing very slow compile std_spec in this branch in -O2, I think this is because all specs compiled in single module (because no classes), so parallel compilation have no benefit here.

@ysbaddaden
Copy link
Contributor

@kostya indeed, the _main.bc file for make std_spec is 29MB in crystal's cache.

@ysbaddaden
Copy link
Contributor

  • praise: some very interesting investigation here! Finally some numbers to back some concepts: likely/unlikely + inlining directly instead of merely hinting LLVM 💯

  • praise: auto-inlining low-complexity methods sounds like a great idea (still need to check for @[NoInline]). Please open issues so we can talk about these (inline in crystal + auto-inline)!

  • suggestion: it would be even more interesting to know which type of optimization have the most impact (and how much) 👀

  • suggestion: if adding @[AlwaysInline] to all methods in Pointer(T) and Intrinsics has such a positive impact, maybe you can open a pull request with just that? they should be called enough that LLVM should always inline them.

  • issue: as outlined by @funny-falcon the removed overflows in Enumerable can lead to overlooked overflows (e.g. Enumerable doesn't have to be finite); we can however consider to override the methods in objects with a finite number of iterations 👍

@kostya
Copy link
Contributor Author

kostya commented Jan 24, 2024

for simplify testing I add this rep: https://github.com/kostya/crystal-metric

--release: 25.6282s

Master:
-O2: 106.0206s

This branch:
-O2: 30.3626s

Master + only inline pointer:
-O2: 65.6595s

In this branch runtime for -O2 close to --release. But compilation much faster, and also is parallel and incremental.

@kostya
Copy link
Contributor Author

kostya commented Jan 25, 2024

My first branch was separated by commits. This is result for crystal-metric on each commit. For -O2.
By commits:

  1. Initial, kostya@8b5fbba, 105.7559s
  2. Codeine always inline: kostya@675669e, 104.1533s
  3. kostya@43514d7, 70.3383s
  4. kostya@5d90685, 70.0416s
  5. kostya@9e8d792, 59.5174s
  6. kostya@1d212fb, 43.9694s
  7. kostya@15933d3, 31.0405s
  8. Likely, kostya@f3f3c40, 31.0076s
  9. kostya@66afa46, 30.6503s
  10. kostya@059ec72, 29.7027s
  11. kostya@3222e28, 30.3150s
  12. kostya@b5f5902, 30.2319s
  13. kostya@9044866, 30.3091s
  14. kostya@2bd9340, 29.7686s
  15. kostya@3ad621f, 30.2483s
  16. kostya@93097f7, 30.0238s
  17. Auto inline setters, kostya@4291b7b, 29.7598s

@ysbaddaden
Copy link
Contributor

Very nice! So, these 4 commits had the most impact:

kostya@43514d7, 70.3383s
kostya@9e8d792, 59.5174s
kostya@1d212fb, 43.9694s
kostya@15933d3, 31.0405s

Looking at them, it's mostly inlining methods, some (un)likely and a few removed overflow checks in Enumerable. I believe what @funny-falcon and I suggested above stands?

@kostya
Copy link
Contributor Author

kostya commented Jan 26, 2024

I see one more thing to improve, which I don't know how to do, this is record.new.

record Bla, x : Int
Bla.new 1

Bla.new - generate call and slowdown many benchmarks, but if I just replace to Tuple, benchmarks became much faster.

@kostya
Copy link
Contributor Author

kostya commented Jan 29, 2024

Plotting this changes for crystal-metric rep. 1.11.2-my is a 1.11.2 recompiled on my MacBook m1 in release mode. Other releases downloaded from GitHub. Funny that it is have faster compile time, may be because of Rosetta, or maybe because of #12060
releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants