Loop Invariant Code Motion #1658

kripken · 2018-08-30T22:24:10Z

This adds an licm pass. Not that important for LLVM-originating code obviously, but for AssemblyScript and other non-LLVM compilers this might help a lot. Also when wasm has GC a bunch more non-LLVM languages may arrive that can benefit.

The pass is mostly straightforward. I considered using the DataFlow IR since it's in SSA form, or the CFG IR, but in the end it's actually pretty convenient to use the main IR as it is - with explicit loops already present - plus LocalGraph which connects each get to the sets influencing it.

Still testing, but this passes a bunch of fuzzing, and also the emscripten test suite at -O1 with licm added to the default passes (but I don't think it would make sense to run this by default, as LLVM doesn't need it).

kripken · 2018-08-30T22:25:03Z

cc @MaxGraey

kripken · 2018-08-31T18:59:29Z

Fuzzing didn't find anything new, so I think this is pretty stable now.

I did see that it can increase code size on fuzz testcases (and, more rarely, on real inputs) due to stuff like this:

(loop
  (set_local $x (i32.const 1))
  ..
)
=>
(set_local $x (i32.const 1))
(loop
  ..
)

For a const or a get_local, such an assignment to a local is both very cheap (a copy to another local may be optimized out later), and moving it out may prevent other optimizations (since we have no pass that tries to move code back in to a loop edit well, not by default, precompute-propagate etc. would do it, but are only run on high opt levels). So I made the pass not move such trivial code (sets/tees of consts or gets).

jgravelle-google · 2018-09-04T18:36:10Z

src/passes/LoopInvariantCodeMotion.cpp

+      if (auto* block = curr->dynCast<Block>()) {
+        auto& list = block->list;
+        Index i = list.size();
+        if (i > 0) {


Why not while (i > 0) {?
Or even for (Index i = list.size(); i > 0; i--) {

Good point, not sure why I had it this way.

(I prefer a while as the other option means writing i - 1 in the loop body.)

jgravelle-google · 2018-09-04T20:36:05Z

src/passes/LoopInvariantCodeMotion.cpp

+            }
+            if (!canMove) {
+              // We failed to move the code, undo those changes.
+              for (auto* set : currSets.list) {


Hm, rather than undoing the changes, how about having a second list, updatedNumSetsForIndex, counting up and seeing if the number of sets is identical. That makes this else clause unnecessary.

Might be slightly clearer, I agree, but on the other hand it means initializing an extra array of the size of numSetsForIndex, which is of size numLocals * 4 bytes (could be a few K, and this is for each item we try to move out).

If this pass were larger or more complicated I'd lean more towards clarity over speed, but as it is I slightly lean towards speed.

jgravelle-google · 2018-09-04T20:42:24Z

src/passes/LoopInvariantCodeMotion.cpp

+              }
+            }
+          }
+          if (canMove) {


All these if (canMove)s are really awkward control flow. This loop is also getting really deeply indented. Can probably fix both of these by extracting some functions here, which can turn this loop into something like:

if (interestingToMove(curr)) { if (!effects.hasGlobalSideEffects() && !effectsSoFar.invalidates(effects) && !(effects.noticesGlobalSideEffects() && loopEffects.hasGlobalSideEffects())) { bool canMove = getsAreOk(curr) && setsAreOk(curr, numSetsForIndex); if (canMove) { // We can move it! ... } } }

This also lets us say things like return false instead of canMove = false; break in those functions, which should simplify things more.

Good point, fixing.

jgravelle-google · 2018-09-04T20:44:52Z

src/passes/LoopInvariantCodeMotion.cpp

+        // The rest of the loop's effects matter too, we must also
+        // take into account global state like interacting loads and
+        // stores.
+        if (!effects.hasGlobalSideEffects() &&


I'd write this

bool unsafeToMove = effects.hasGlobalSideEffects() || effectsSoFar.invalidates(effects) || (effects.noticesGlobalSideEffects() && loopEffects.hasGlobalSideEffects()); if (!unsafeToMove) {

just to factor out the common !

Good idea, fixing.

MaxGraey · 2018-09-14T17:44:57Z

@kripken Could you include this LICM pass under -O4 opt flag (which I hope apply it in early stage)? I tried use LICM as additional pass after all with combination of "flatten" but it significantly increase binary size.

kripken · 2018-09-14T19:36:37Z

@MaxGraey I think we need to do some investigation before deciding where - did you measure it in both locations?

My guess is that we need to run it early, as you said, and in particular that we need to run precompute-propagate (to propagate constants) as well as the standard optimizations after it. But I didn't measure it carefully yet.

kripken · 2018-09-14T20:03:25Z

Doing some quick tests, licm doesn't always help code size (even when done early). But I was looking on fuzz testcases, which is not great - we should look at real-world code.

MaxGraey · 2018-09-14T20:40:00Z

I created example: https://webassembly.studio/?f=p1ec32awtqi

Currently it uses "-O3" and "--runPasses", "precompute-propagate, licm". "-O3" in AssemblyScript is actually forced to "-O4" for binaryen internally.

If I adding "flatten" to passe like "flatten, precompute-propagate, licm". With "flatten" I got much bigger and not optimal result:
https://webassembly.studio/?f=pwdlgy78krs

kripken · 2018-09-15T01:55:10Z

@MaxGraey I'm not sure what to look for in that first link? How can I tell which passes it runs when I click "Build"?

Note that if you run flatten, you should run all normal opts later. --flatten -O3 for example.

Meanwhile in my testing, I think it's expected that licm may increase code size. It can move something out of a loop that then requires saving it to a local and loading it, in particular - so it's faster but more verbose. I think maybe the right place is to do it early, but not when optimizing for size (-Os, --shrink-level, etc.)

MaxGraey · 2018-09-15T02:08:39Z

At first link AS used -O4 and "precompute-propagate + licm" additional passes but without "flatten". Result the same as -O4 without licm and produce this code for loop:

...
(func $licmTest (export "licmTest") (type $t0) (param $p0 i32) (result i32)
    (local $l0 i32) (local $l1 i32)
    block $B0
      loop $L1
        get_local $l0
        get_local $p0
        i32.ge_s
        br_if $B0
        get_local $l1
        f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
        f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
        f64.const 0x1.8246d97276aabp-3 (;=0.188612;)
        call $f0
        f64.mul
        f64.add
        i32.trunc_s/f64
        i32.add
        set_local $l1
        get_local $l0
        i32.const 1
        i32.add
        set_local $l0
        br $L1
        unreachable
      end
      unreachable
    end
    get_local $l1)

Second link using -O4 + flatten + precompute-propagate + licm and produce this:

(func $licmTest (export "licmTest") (type $t0) (param $p0 i32) (result i32)
    (local $l0 i32) (local $l1 i32) (local $l2 i32) (local $l3 i32) (local $l4 i32) (local $l5 i32) (local $l6 i32) (local $l7 i32) (local $l8 i32) (local $l9 i32) (local $l10 i32) (local $l11 i32) (local $l12 i32) (local $l13 f64) (local $l14 f64) (local $l15 f64)
    block $B0
      block $B1
        loop $L2
          block $B3
            get_local $l0
            set_local $l2
            get_local $p0
            set_local $l3
            get_local $l2
            get_local $l3
            i32.ge_s
            set_local $l4
            get_local $l4
            br_if $B1
            nop
            get_local $l1
            set_local $l5
            f64.const 0x1.8246d97276aabp-3 (;=0.188612;)
            call $f0
            set_local $l13
            f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
            get_local $l13
            f64.mul
            set_local $l14
            f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
            get_local $l14
            f64.add
            set_local $l15
            get_local $l15
            i32.trunc_s/f64
            set_local $l6
            get_local $l5
            get_local $l6
            i32.add
            set_local $l7
            get_local $l7
            set_local $l1
            nop
            get_local $l0
            set_local $l8
            get_local $l8
            i32.const 1
            i32.add
            set_local $l9
            get_local $l9
            set_local $l0
            nop
            br $L2
            unreachable
            unreachable
          end
          unreachable
          unreachable
          unreachable
        end
        unreachable
        unreachable
      end
      nop
      get_local $l1
      set_local $l10
      get_local $l10
      set_local $l11
    end
    get_local $l11
    set_local $l12
    get_local $l12
    return)

Both versions it seems don't do LICM pass. Or I miss something

kripken · 2018-09-17T18:44:11Z

The second output looks like it's not fully optimized - the nops should definitely not be there. Maybe something is going wrong in those links and it's not running the proper opts? Would be good to get a testcase that reproduces in the shell, where it's easy to follow which passes are run.

Otherwise, in both loops licm can't help much due to the call. Perhaps we should tune the inliner to handle that, but I'm not sure - that would need to be done carefully.

But thinking more about the general issue, I think we need good benchmarks here - VMs will do licm themselves (except for baseline compilers or interpreters perhaps), so since licm may increase code size, I am thinking it's generally better not to do it in Binaryen. But if we can find a benchmark showing otherwise, that would be useful.

kripken · 2018-09-17T18:45:12Z

About those nops, when you wrote

-O4 + flatten + precompute-propagate + licm

Did you mean -O4 is run first? it should be run after the other three.

MaxGraey · 2018-09-17T19:07:45Z

Did you mean -O4 is run first? it should be run after the other three.
It seems AssemblyScript apply general optimizations first and after that add optional extra passes, but better ask that to @dcodeIO about that

kripken added 30 commits August 27, 2018 16:26

wip [ci skip]

b33011a

fixes

7e3da97

fix [ci skip]

9c5f453

builds

d7376f2

test

e92a62b

test

90322f3

test

3dd4bcd

test

882aa81

test

8ae3120

test

2418ca4

test

ade42a6

test

37a8805

test

a1335ed

test

05d3169

update build-js.sh

92ebb42

test

6202e06

test

fcc5e36

test

cbffcbe

test

6a668bf

test

5391bb6

test

079ea7a

test

2e21637

tests [ci skip]

08d3837

cfg approach [ci skip]

78447f0

fixes

0c3df43

undo

f78edc5

fix

68b21ab

fix iloop

a351a0c

fix

76fb72e

fuzz testcase

49b4689

kripken added 9 commits August 30, 2018 08:29

fix [ci skip]

c3311ad

test output [ci skip]

a2bb4fb

fix [ci skip]

d3a2c03

fix [ci skip]

c1664ec

fixup

9b7c53f

comment

60d3b98

comment

c82f20d

cleanup

0b71654

Merge remote-tracking branch 'origin/master' into licm

8bf1fd0

kripken added 5 commits August 30, 2018 15:26

comments

fcd9ba2

don't licm local copies

8f50dd9

also constants

3f5f8bf

more [ci skip]

6b42032

update

11d773a

jgravelle-google reviewed Sep 4, 2018

View reviewed changes

review feedback

9c34f4b

jgravelle-google approved these changes Sep 4, 2018

View reviewed changes

kripken merged commit ff5d6f9 into master Sep 4, 2018

kripken deleted the licm branch September 4, 2018 23:37

Loop Invariant Code Motion #1658

Loop Invariant Code Motion #1658

Uh oh!

Conversation

kripken commented Aug 30, 2018

Uh oh!

kripken commented Aug 30, 2018

Uh oh!

kripken commented Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGraey commented Sep 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kripken commented Sep 14, 2018

Uh oh!

kripken commented Sep 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGraey commented Sep 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kripken commented Sep 15, 2018

Uh oh!

MaxGraey commented Sep 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kripken commented Sep 17, 2018

Uh oh!

kripken commented Sep 17, 2018

Uh oh!

MaxGraey commented Sep 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kripken commented Aug 31, 2018 •

edited

Loading

MaxGraey commented Sep 14, 2018 •

edited

Loading

kripken commented Sep 14, 2018 •

edited

Loading

MaxGraey commented Sep 14, 2018 •

edited

Loading

MaxGraey commented Sep 15, 2018 •

edited

Loading