-
Notifications
You must be signed in to change notification settings - Fork 827
Loop Invariant Code Motion #1658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @MaxGraey |
|
Fuzzing didn't find anything new, so I think this is pretty stable now. I did see that it can increase code size on fuzz testcases (and, more rarely, on real inputs) due to stuff like this: For a const or a |
| if (auto* block = curr->dynCast<Block>()) { | ||
| auto& list = block->list; | ||
| Index i = list.size(); | ||
| if (i > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not while (i > 0) {?
Or even for (Index i = list.size(); i > 0; i--) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, not sure why I had it this way.
(I prefer a while as the other option means writing i - 1 in the loop body.)
| } | ||
| if (!canMove) { | ||
| // We failed to move the code, undo those changes. | ||
| for (auto* set : currSets.list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, rather than undoing the changes, how about having a second list, updatedNumSetsForIndex, counting up and seeing if the number of sets is identical. That makes this else clause unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be slightly clearer, I agree, but on the other hand it means initializing an extra array of the size of numSetsForIndex, which is of size numLocals * 4 bytes (could be a few K, and this is for each item we try to move out).
If this pass were larger or more complicated I'd lean more towards clarity over speed, but as it is I slightly lean towards speed.
| } | ||
| } | ||
| } | ||
| if (canMove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these if (canMove)s are really awkward control flow. This loop is also getting really deeply indented. Can probably fix both of these by extracting some functions here, which can turn this loop into something like:
if (interestingToMove(curr)) {
if (!effects.hasGlobalSideEffects() &&
!effectsSoFar.invalidates(effects) &&
!(effects.noticesGlobalSideEffects() && loopEffects.hasGlobalSideEffects())) {
bool canMove = getsAreOk(curr) && setsAreOk(curr, numSetsForIndex);
if (canMove) {
// We can move it! ...
}
}
}This also lets us say things like return false instead of canMove = false; break in those functions, which should simplify things more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixing.
| // The rest of the loop's effects matter too, we must also | ||
| // take into account global state like interacting loads and | ||
| // stores. | ||
| if (!effects.hasGlobalSideEffects() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write this
bool unsafeToMove = effects.hasGlobalSideEffects() ||
effectsSoFar.invalidates(effects) ||
(effects.noticesGlobalSideEffects() && loopEffects.hasGlobalSideEffects());
if (!unsafeToMove) {just to factor out the common !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, fixing.
|
@kripken Could you include this LICM pass under -O4 opt flag (which I hope apply it in early stage)? I tried use LICM as additional pass after all with combination of "flatten" but it significantly increase binary size. |
|
@MaxGraey I think we need to do some investigation before deciding where - did you measure it in both locations? My guess is that we need to run it early, as you said, and in particular that we need to run |
|
Doing some quick tests, licm doesn't always help code size (even when done early). But I was looking on fuzz testcases, which is not great - we should look at real-world code. |
|
I created example: https://webassembly.studio/?f=p1ec32awtqi Currently it uses "-O3" and "--runPasses", "precompute-propagate, licm". "-O3" in AssemblyScript is actually forced to "-O4" for binaryen internally. If I adding "flatten" to passe like "flatten, precompute-propagate, licm". With "flatten" I got much bigger and not optimal result: |
|
@MaxGraey I'm not sure what to look for in that first link? How can I tell which passes it runs when I click "Build"? Note that if you run flatten, you should run all normal opts later. Meanwhile in my testing, I think it's expected that licm may increase code size. It can move something out of a loop that then requires saving it to a local and loading it, in particular - so it's faster but more verbose. I think maybe the right place is to do it early, but not when optimizing for size (-Os, --shrink-level, etc.) |
|
At first link AS used -O4 and "precompute-propagate + licm" additional passes but without "flatten". Result the same as -O4 without licm and produce this code for loop: ...
(func $licmTest (export "licmTest") (type $t0) (param $p0 i32) (result i32)
(local $l0 i32) (local $l1 i32)
block $B0
loop $L1
get_local $l0
get_local $p0
i32.ge_s
br_if $B0
get_local $l1
f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
f64.const 0x1.8246d97276aabp-3 (;=0.188612;)
call $f0
f64.mul
f64.add
i32.trunc_s/f64
i32.add
set_local $l1
get_local $l0
i32.const 1
i32.add
set_local $l0
br $L1
unreachable
end
unreachable
end
get_local $l1)Second link using -O4 + flatten + precompute-propagate + licm and produce this: (func $licmTest (export "licmTest") (type $t0) (param $p0 i32) (result i32)
(local $l0 i32) (local $l1 i32) (local $l2 i32) (local $l3 i32) (local $l4 i32) (local $l5 i32) (local $l6 i32) (local $l7 i32) (local $l8 i32) (local $l9 i32) (local $l10 i32) (local $l11 i32) (local $l12 i32) (local $l13 f64) (local $l14 f64) (local $l15 f64)
block $B0
block $B1
loop $L2
block $B3
get_local $l0
set_local $l2
get_local $p0
set_local $l3
get_local $l2
get_local $l3
i32.ge_s
set_local $l4
get_local $l4
br_if $B1
nop
get_local $l1
set_local $l5
f64.const 0x1.8246d97276aabp-3 (;=0.188612;)
call $f0
set_local $l13
f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
get_local $l13
f64.mul
set_local $l14
f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
get_local $l14
f64.add
set_local $l15
get_local $l15
i32.trunc_s/f64
set_local $l6
get_local $l5
get_local $l6
i32.add
set_local $l7
get_local $l7
set_local $l1
nop
get_local $l0
set_local $l8
get_local $l8
i32.const 1
i32.add
set_local $l9
get_local $l9
set_local $l0
nop
br $L2
unreachable
unreachable
end
unreachable
unreachable
unreachable
end
unreachable
unreachable
end
nop
get_local $l1
set_local $l10
get_local $l10
set_local $l11
end
get_local $l11
set_local $l12
get_local $l12
return)Both versions it seems don't do LICM pass. Or I miss something |
|
The second output looks like it's not fully optimized - the Otherwise, in both loops licm can't help much due to the But thinking more about the general issue, I think we need good benchmarks here - VMs will do licm themselves (except for baseline compilers or interpreters perhaps), so since licm may increase code size, I am thinking it's generally better not to do it in Binaryen. But if we can find a benchmark showing otherwise, that would be useful. |
|
About those nops, when you wrote
Did you mean |
|
This adds an licm pass. Not that important for LLVM-originating code obviously, but for AssemblyScript and other non-LLVM compilers this might help a lot. Also when wasm has GC a bunch more non-LLVM languages may arrive that can benefit.
The pass is mostly straightforward. I considered using the DataFlow IR since it's in SSA form, or the CFG IR, but in the end it's actually pretty convenient to use the main IR as it is - with explicit loops already present - plus LocalGraph which connects each get to the sets influencing it.
Still testing, but this passes a bunch of fuzzing, and also the emscripten test suite at -O1 with licm added to the default passes (but I don't think it would make sense to run this by default, as LLVM doesn't need it).