Skip to content

Conversation

@kripken
Copy link
Member

@kripken kripken commented Aug 30, 2018

This adds an licm pass. Not that important for LLVM-originating code obviously, but for AssemblyScript and other non-LLVM compilers this might help a lot. Also when wasm has GC a bunch more non-LLVM languages may arrive that can benefit.

The pass is mostly straightforward. I considered using the DataFlow IR since it's in SSA form, or the CFG IR, but in the end it's actually pretty convenient to use the main IR as it is - with explicit loops already present - plus LocalGraph which connects each get to the sets influencing it.

Still testing, but this passes a bunch of fuzzing, and also the emscripten test suite at -O1 with licm added to the default passes (but I don't think it would make sense to run this by default, as LLVM doesn't need it).

@kripken
Copy link
Member Author

kripken commented Aug 30, 2018

cc @MaxGraey

@kripken
Copy link
Member Author

kripken commented Aug 31, 2018

Fuzzing didn't find anything new, so I think this is pretty stable now.

I did see that it can increase code size on fuzz testcases (and, more rarely, on real inputs) due to stuff like this:

(loop
  (set_local $x (i32.const 1))
  ..
)
=>
(set_local $x (i32.const 1))
(loop
  ..
)

For a const or a get_local, such an assignment to a local is both very cheap (a copy to another local may be optimized out later), and moving it out may prevent other optimizations (since we have no pass that tries to move code back in to a loop edit well, not by default, precompute-propagate etc. would do it, but are only run on high opt levels). So I made the pass not move such trivial code (sets/tees of consts or gets).

if (auto* block = curr->dynCast<Block>()) {
auto& list = block->list;
Index i = list.size();
if (i > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not while (i > 0) {?
Or even for (Index i = list.size(); i > 0; i--) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, not sure why I had it this way.

(I prefer a while as the other option means writing i - 1 in the loop body.)

}
if (!canMove) {
// We failed to move the code, undo those changes.
for (auto* set : currSets.list) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, rather than undoing the changes, how about having a second list, updatedNumSetsForIndex, counting up and seeing if the number of sets is identical. That makes this else clause unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be slightly clearer, I agree, but on the other hand it means initializing an extra array of the size of numSetsForIndex, which is of size numLocals * 4 bytes (could be a few K, and this is for each item we try to move out).

If this pass were larger or more complicated I'd lean more towards clarity over speed, but as it is I slightly lean towards speed.

}
}
}
if (canMove) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these if (canMove)s are really awkward control flow. This loop is also getting really deeply indented. Can probably fix both of these by extracting some functions here, which can turn this loop into something like:

if (interestingToMove(curr)) {
  if (!effects.hasGlobalSideEffects() &&
      !effectsSoFar.invalidates(effects) &&
      !(effects.noticesGlobalSideEffects() && loopEffects.hasGlobalSideEffects())) {
    bool canMove = getsAreOk(curr) && setsAreOk(curr, numSetsForIndex);
    if (canMove) {
      // We can move it! ...
    }
  }
}

This also lets us say things like return false instead of canMove = false; break in those functions, which should simplify things more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixing.

// The rest of the loop's effects matter too, we must also
// take into account global state like interacting loads and
// stores.
if (!effects.hasGlobalSideEffects() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write this

bool unsafeToMove = effects.hasGlobalSideEffects() ||
    effectsSoFar.invalidates(effects) ||
    (effects.noticesGlobalSideEffects() && loopEffects.hasGlobalSideEffects());
if (!unsafeToMove) {

just to factor out the common !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, fixing.

@kripken kripken merged commit ff5d6f9 into master Sep 4, 2018
@kripken kripken deleted the licm branch September 4, 2018 23:37
@MaxGraey
Copy link
Contributor

MaxGraey commented Sep 14, 2018

@kripken Could you include this LICM pass under -O4 opt flag (which I hope apply it in early stage)? I tried use LICM as additional pass after all with combination of "flatten" but it significantly increase binary size.

@kripken
Copy link
Member Author

kripken commented Sep 14, 2018

@MaxGraey I think we need to do some investigation before deciding where - did you measure it in both locations?

My guess is that we need to run it early, as you said, and in particular that we need to run precompute-propagate (to propagate constants) as well as the standard optimizations after it. But I didn't measure it carefully yet.

@kripken
Copy link
Member Author

kripken commented Sep 14, 2018

Doing some quick tests, licm doesn't always help code size (even when done early). But I was looking on fuzz testcases, which is not great - we should look at real-world code.

@MaxGraey
Copy link
Contributor

MaxGraey commented Sep 14, 2018

I created example: https://webassembly.studio/?f=p1ec32awtqi

Currently it uses "-O3" and "--runPasses", "precompute-propagate, licm". "-O3" in AssemblyScript is actually forced to "-O4" for binaryen internally.

If I adding "flatten" to passe like "flatten, precompute-propagate, licm". With "flatten" I got much bigger and not optimal result:
https://webassembly.studio/?f=pwdlgy78krs

@kripken
Copy link
Member Author

kripken commented Sep 15, 2018

@MaxGraey I'm not sure what to look for in that first link? How can I tell which passes it runs when I click "Build"?

Note that if you run flatten, you should run all normal opts later. --flatten -O3 for example.

Meanwhile in my testing, I think it's expected that licm may increase code size. It can move something out of a loop that then requires saving it to a local and loading it, in particular - so it's faster but more verbose. I think maybe the right place is to do it early, but not when optimizing for size (-Os, --shrink-level, etc.)

@MaxGraey
Copy link
Contributor

MaxGraey commented Sep 15, 2018

At first link AS used -O4 and "precompute-propagate + licm" additional passes but without "flatten". Result the same as -O4 without licm and produce this code for loop:

...
(func $licmTest (export "licmTest") (type $t0) (param $p0 i32) (result i32)
    (local $l0 i32) (local $l1 i32)
    block $B0
      loop $L1
        get_local $l0
        get_local $p0
        i32.ge_s
        br_if $B0
        get_local $l1
        f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
        f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
        f64.const 0x1.8246d97276aabp-3 (;=0.188612;)
        call $f0
        f64.mul
        f64.add
        i32.trunc_s/f64
        i32.add
        set_local $l1
        get_local $l0
        i32.const 1
        i32.add
        set_local $l0
        br $L1
        unreachable
      end
      unreachable
    end
    get_local $l1)

Second link using -O4 + flatten + precompute-propagate + licm and produce this:

(func $licmTest (export "licmTest") (type $t0) (param $p0 i32) (result i32)
    (local $l0 i32) (local $l1 i32) (local $l2 i32) (local $l3 i32) (local $l4 i32) (local $l5 i32) (local $l6 i32) (local $l7 i32) (local $l8 i32) (local $l9 i32) (local $l10 i32) (local $l11 i32) (local $l12 i32) (local $l13 f64) (local $l14 f64) (local $l15 f64)
    block $B0
      block $B1
        loop $L2
          block $B3
            get_local $l0
            set_local $l2
            get_local $p0
            set_local $l3
            get_local $l2
            get_local $l3
            i32.ge_s
            set_local $l4
            get_local $l4
            br_if $B1
            nop
            get_local $l1
            set_local $l5
            f64.const 0x1.8246d97276aabp-3 (;=0.188612;)
            call $f0
            set_local $l13
            f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
            get_local $l13
            f64.mul
            set_local $l14
            f64.const 0x1.bcb7b1526e50ep-2 (;=0.434294;)
            get_local $l14
            f64.add
            set_local $l15
            get_local $l15
            i32.trunc_s/f64
            set_local $l6
            get_local $l5
            get_local $l6
            i32.add
            set_local $l7
            get_local $l7
            set_local $l1
            nop
            get_local $l0
            set_local $l8
            get_local $l8
            i32.const 1
            i32.add
            set_local $l9
            get_local $l9
            set_local $l0
            nop
            br $L2
            unreachable
            unreachable
          end
          unreachable
          unreachable
          unreachable
        end
        unreachable
        unreachable
      end
      nop
      get_local $l1
      set_local $l10
      get_local $l10
      set_local $l11
    end
    get_local $l11
    set_local $l12
    get_local $l12
    return)

Both versions it seems don't do LICM pass. Or I miss something

@kripken
Copy link
Member Author

kripken commented Sep 17, 2018

The second output looks like it's not fully optimized - the nops should definitely not be there. Maybe something is going wrong in those links and it's not running the proper opts? Would be good to get a testcase that reproduces in the shell, where it's easy to follow which passes are run.

Otherwise, in both loops licm can't help much due to the call. Perhaps we should tune the inliner to handle that, but I'm not sure - that would need to be done carefully.

But thinking more about the general issue, I think we need good benchmarks here - VMs will do licm themselves (except for baseline compilers or interpreters perhaps), so since licm may increase code size, I am thinking it's generally better not to do it in Binaryen. But if we can find a benchmark showing otherwise, that would be useful.

@kripken
Copy link
Member Author

kripken commented Sep 17, 2018

About those nops, when you wrote

-O4 + flatten + precompute-propagate + licm

Did you mean -O4 is run first? it should be run after the other three.

@MaxGraey
Copy link
Contributor

Did you mean -O4 is run first? it should be run after the other three.
It seems AssemblyScript apply general optimizations first and after that add optional extra passes, but better ask that to @dcodeIO about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants