Optimize runtime for non single-module(01,02) compilation #14225

kostya · 2024-01-12T14:48:32Z

All this changes questionable and I don't know even if it pass specs. Also optimization results was for my set of benchmarks, and I don't know if it would work for something else.

Changes:

behavior of AlwaysInline, now AlwaysInline behave like it sounds, purely inline code on crystal level. Before it just marks it in llvm (so inline works only in current llvm module, no cross module inlining). Not sure if backtrace btw works. So now all methods marked by AlwaysInline, like headers in C++. Also mark by this AlwaysInline many small but critical methods, which have most slowdown in per module compilation. This marks was only done for methods, which I only found was to be slow, in my tests. But I am sure there also many methods which I missed.
Also add likely intrinsic which should help in non single-module optimizations.

Results:

In this PR -O2 became is a decent level of compilation, which runtime is 0.9-2 times slower than release (in 1 test even faster), but compile time is 3-4 times faster, incremental compile time: 5-10 times faster.
--release just little faster after all this optimizations, because single-module and O3 doing similar thing with inlining.
On the recompile compiler this optimizations have no visible effect.

Example of optimization from some benchmarks:

brainfuck
Master: -O2: initial: 2.40s, incremental: 1.33s, run time: 19.71s
This branch: -O2: initial: 2.25s, incremental: 1.32s, run time: 4.90s
brainfuck runtime 4 times faster in this branch. and almost equal release result: 4.26s (which compile time is 6.74s)

HertzDevil · 2024-01-12T16:46:23Z

There are multiple spec failures, so it looks like this has altered the semantics of some programs

straight-shoota · 2024-01-12T22:38:38Z

This is really good, awesome work 👍

I think we'll need to look at the individual actions independently, though.

There is already an pending discussion about likely/unlikely intrinsices: #11910. We should continue this part there. I think there's a general consensus on supporting it in some way, but the concrete implementation is still up for debate.

I don't think there has been any dedicated discussion about inlining methods on the Crystal side, except what you already mentioned in #13505. So this would be a new discussion to start. The idea sounds definitely interesting. But there is a wide range of possibilities how it can be implemented.

Then I also see a number of other changes. Not sure what's the reason for adding parenthesis to yield expressions?
Using wrapping math operators in places where overflow is impossible should be an easy thing. I'm wondering if we could do something about this in the compiler even...

kostya · 2024-01-12T22:53:21Z

parenthesis remains because I try likely here, but somehow it compile error, so I remove it.

Optimizations which add huge impact:

this line: https://github.com/crystal-lang/crystal/pull/14225/files#diff-160d80555792cf986b3f299649b59e7b973ac45a415cafe2978905ccfacce4d1R576, each_index is a method which used in almost every cycle. And remove checked add generate better asm.
all inlines in scr/pointer.cr - Pointer used every where, so making call for it, is huge overhead.
https://github.com/crystal-lang/crystal/pull/14225/files#diff-f424630a978047aee2dd7194fe51fea5a69342ab1a9de74342b0382f01dc6211L40

zw963 · 2024-01-13T04:56:01Z

Awesome work! @kostya although I don’t understand these codes, but look at the final result, it great!

zw963 · 2024-01-13T05:11:52Z

In fact, I hope the core members can raise the priority for this meaningful discussion.

Don't expect like #13464，it take eight months from proposal to merge, too long, it destroying the enthusiasm of OS contributors, it also destroying the enthusiasm of people using Crystal, In fact, I feel that if it weren’t the hard work of @kostya and @funny-falcon 's follow-up, #13464 , It probably never been merged until now.

Without the help of open source contributors, it would be difficult for Crystal to make great progress. just like the release of 1.11.0, many issue were discovered and fixed very quickly in 1.11.1 👍

funny-falcon · 2024-01-13T13:11:43Z

Shouldn’t each_with_index specialized then for common collections? At least Array already have index when it iterates, no need for other var.

I see problem in untyped offset parameter of each_with_index and wrapping addition: what if user pass UInt8 as an offset? What if we iterate external storage, so its index could be larger than 2G?

kostya · 2024-01-13T14:54:56Z

instead of AlwaysInline, we can for example calculate somehow method complexity, and inline if it small, like <5 instructions(even with all unfolded calls) and not have cycles and not self recursion. But I not see how to do it in current compiler, because it generate llvm directly without semi form.

kostya · 2024-01-13T14:56:36Z

Another idea, mark full class with AlwaysInline, like Pointer and automatically inline all calls to it.

kostya · 2024-01-13T15:01:16Z

Also there is already small inlining in generate calls https://github.com/crystal-lang/crystal/blob/master/src/compiler/crystal/codegen/call.cr#L449, but it inline only primitives, self and getters. It also should inline setters for example.

kostya · 2024-01-16T11:28:22Z

I experiencing very slow compile std_spec in this branch in -O2, I think this is because all specs compiled in single module (because no classes), so parallel compilation have no benefit here.

ysbaddaden · 2024-01-22T14:07:06Z

@kostya indeed, the _main.bc file for make std_spec is 29MB in crystal's cache.

ysbaddaden · 2024-01-22T14:34:51Z

praise: some very interesting investigation here! Finally some numbers to back some concepts: likely/unlikely + inlining directly instead of merely hinting LLVM 💯
praise: auto-inlining low-complexity methods sounds like a great idea (still need to check for @[NoInline]). Please open issues so we can talk about these (inline in crystal + auto-inline)!
suggestion: it would be even more interesting to know which type of optimization have the most impact (and how much) 👀
suggestion: if adding @[AlwaysInline] to all methods in Pointer(T) and Intrinsics has such a positive impact, maybe you can open a pull request with just that? they should be called enough that LLVM should always inline them.
issue: as outlined by @funny-falcon the removed overflows in Enumerable can lead to overlooked overflows (e.g. Enumerable doesn't have to be finite); we can however consider to override the methods in objects with a finite number of iterations 👍

kostya · 2024-01-24T16:34:17Z

for simplify testing I add this rep: https://github.com/kostya/crystal-metric

--release: 25.6282s

Master:
-O2: 106.0206s

This branch:
-O2: 30.3626s

Master + only inline pointer:
-O2: 65.6595s

In this branch runtime for -O2 close to --release. But compilation much faster, and also is parallel and incremental.

kostya · 2024-01-25T21:28:23Z

My first branch was separated by commits. This is result for crystal-metric on each commit. For -O2.
By commits:

Initial, kostya@8b5fbba, 105.7559s
Codeine always inline: kostya@675669e, 104.1533s
kostya@43514d7, 70.3383s
kostya@5d90685, 70.0416s
kostya@9e8d792, 59.5174s
kostya@1d212fb, 43.9694s
kostya@15933d3, 31.0405s
Likely, kostya@f3f3c40, 31.0076s
kostya@66afa46, 30.6503s
kostya@059ec72, 29.7027s
kostya@3222e28, 30.3150s
kostya@b5f5902, 30.2319s
kostya@9044866, 30.3091s
kostya@2bd9340, 29.7686s
kostya@3ad621f, 30.2483s
kostya@93097f7, 30.0238s
Auto inline setters, kostya@4291b7b, 29.7598s

ysbaddaden · 2024-01-26T10:00:53Z

Very nice! So, these 4 commits had the most impact:

kostya@43514d7, 70.3383s
kostya@9e8d792, 59.5174s
kostya@1d212fb, 43.9694s
kostya@15933d3, 31.0405s

Looking at them, it's mostly inlining methods, some (un)likely and a few removed overflow checks in Enumerable. I believe what @funny-falcon and I suggested above stands?

kostya · 2024-01-26T19:42:25Z

I see one more thing to improve, which I don't know how to do, this is record.new.

record Bla, x : Int
Bla.new 1

Bla.new - generate call and slowdown many benchmarks, but if I just replace to Tuple, benchmarks became much faster.

kostya · 2024-01-29T21:01:11Z

Plotting this changes for crystal-metric rep. 1.11.2-my is a 1.11.2 recompiled on my MacBook m1 in release mode. Other releases downloaded from GitHub. Funny that it is have faster compile time, may be because of Rosetta, or maybe because of #12060

Optimize runtime for non single-module(01,02,03) compilation

8d1d7ab

kostya mentioned this pull request Jan 12, 2024

More codegen optimization options #13505

Closed

Blacksmoke16 added performance status:discussion topic:compiler:codegen labels Jan 12, 2024

fix spec?

bb0f086

straight-shoota marked this pull request as draft January 12, 2024 22:46

kostya mentioned this pull request Sep 11, 2024

Post: digging into struct initialization performance crystal-lang/crystal-website#827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize runtime for non single-module(01,02) compilation #14225

Optimize runtime for non single-module(01,02) compilation #14225

kostya commented Jan 12, 2024 •

edited

Loading

HertzDevil commented Jan 12, 2024

straight-shoota commented Jan 12, 2024

kostya commented Jan 12, 2024

zw963 commented Jan 13, 2024

zw963 commented Jan 13, 2024 •

edited

Loading

funny-falcon commented Jan 13, 2024

kostya commented Jan 13, 2024

kostya commented Jan 13, 2024

kostya commented Jan 13, 2024

kostya commented Jan 16, 2024

ysbaddaden commented Jan 22, 2024

ysbaddaden commented Jan 22, 2024

kostya commented Jan 24, 2024 •

edited

Loading

kostya commented Jan 25, 2024 •

edited

Loading

ysbaddaden commented Jan 26, 2024

kostya commented Jan 26, 2024 •

edited

Loading

kostya commented Jan 29, 2024

Optimize runtime for non single-module(01,02) compilation #14225

Are you sure you want to change the base?

Optimize runtime for non single-module(01,02) compilation #14225

Conversation

kostya commented Jan 12, 2024 • edited Loading

Changes:

Results:

Example of optimization from some benchmarks:

HertzDevil commented Jan 12, 2024

straight-shoota commented Jan 12, 2024

kostya commented Jan 12, 2024

zw963 commented Jan 13, 2024

zw963 commented Jan 13, 2024 • edited Loading

funny-falcon commented Jan 13, 2024

kostya commented Jan 13, 2024

kostya commented Jan 13, 2024

kostya commented Jan 13, 2024

kostya commented Jan 16, 2024

ysbaddaden commented Jan 22, 2024

ysbaddaden commented Jan 22, 2024

kostya commented Jan 24, 2024 • edited Loading

kostya commented Jan 25, 2024 • edited Loading

ysbaddaden commented Jan 26, 2024

kostya commented Jan 26, 2024 • edited Loading

kostya commented Jan 29, 2024

kostya commented Jan 12, 2024 •

edited

Loading

zw963 commented Jan 13, 2024 •

edited

Loading

kostya commented Jan 24, 2024 •

edited

Loading

kostya commented Jan 25, 2024 •

edited

Loading

kostya commented Jan 26, 2024 •

edited

Loading