-
Notifications
You must be signed in to change notification settings - Fork 786
Optimize tuple.extract of gets in BinaryInstWriter #5941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
Codecov Report
@@ Coverage Diff @@
## main #5941 +/- ##
==========================================
- Coverage 42.80% 42.79% -0.01%
==========================================
Files 485 485
Lines 74943 74963 +20
Branches 11958 11965 +7
==========================================
+ Hits 32076 32084 +8
- Misses 39665 39670 +5
- Partials 3202 3209 +7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
) | ||
) | ||
(local.get $13) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement!
src/wasm-stack.h
Outdated
@@ -148,6 +148,11 @@ class BinaryInstWriter : public OverriddenVisitor<BinaryInstWriter> { | |||
InsertOrderedMap<Type, Index> scratchLocals; | |||
void countScratchLocals(); | |||
void setScratchLocals(); | |||
|
|||
// local.get and local.tee expressions that will be followed by | |||
// tuple.extracts. We can optimize these by only getting only the local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// tuple.extracts. We can optimize these by only getting only the local | |
// tuple.extracts. We can optimize these by getting only the local |
src/wasm/wasm-stack.cpp
Outdated
// We only need to get the single extracted value. | ||
if (it->second == 0) { | ||
o << int8_t(BinaryConsts::LocalTee) | ||
<< U32LEB(mappedLocals[std::make_pair(curr->index, it->second)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< U32LEB(mappedLocals[std::make_pair(curr->index, it->second)]); | |
<< U32LEB(mappedLocals[std::make_pair(curr->index, 0)]); |
o << int8_t(BinaryConsts::GlobalGet) << U32LEB(index + it->second); | ||
return; | ||
} | ||
// Emit a global.get for each element if this is a tuple global |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, we allow tuple globals? Does the wasm spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we weren't originally going to support tuple globals, but when we first implemented all this tuple business we found that some passes needed them. I don't remember which ones those were, though.
The Wasm spec supports tuple globals only as much as it supports tuple locals, i.e. not at all. But that's ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... a little odd, but ok I guess 😄 I suppose the passes would need to create more globals or such, otherwise.
src/wasm/wasm-stack.cpp
Outdated
@@ -2532,6 +2564,15 @@ void BinaryInstWriter::countScratchLocals() { | |||
for (auto& [type, _] : scratchLocals) { | |||
noteLocalType(type); | |||
} | |||
// While we have all the tuple.extracts, also find extracts of local.gets and | |||
// local.tees that we can optimize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment could mention global gets too.
test/multivalue.wast
Outdated
(global.get $g2) | ||
(tuple.extract 1 | ||
(global.get $g2) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have a test of a global.get
of a tuple that does not flow into an extract
?
952099a
to
386b912
Compare
Rebased on top of a separate PR moving the tests to lit, which also made sure we are testing both an extract and non-extract use of a tuple global.get. The other comments have been addressed in the latest commit. |
o << int8_t(BinaryConsts::GlobalGet) << U32LEB(index + it->second); | ||
return; | ||
} | ||
// Emit a global.get for each element if this is a tuple global |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... a little odd, but ok I guess 😄 I suppose the passes would need to create more globals or such, otherwise.
In general, the binary lowering of tuple.extract expects that all the tuple values are on top of the stack, so it inserts drops and possibly uses a scratch local to ensure only the extracted value is left. However, when the extracted tuple expression is a local.get, local.tee, or global.get, it's much more efficient to change the lowering of the get or tee to ensure that only the extracted value is on the stack to begin with. Implement that optimization in the binary writer.
386b912
to
20d855e
Compare
Fuzz bug: (module
(type $0 (func (result i64 f32 i32 i32 i32)))
(type $1 (func (result f32)))
(func $0 (type $0) (result i64 f32 i32 i32 i32)
(local $3 (i64 f32 i32 i32 i32))
(local $4 f32)
(local.set $3
(tuple.make
(i64.const 0)
(call $1)
(i32.const 0)
(i32.const 0)
(i32.const 0)
)
)
(local.set $4
(tuple.extract 1
(local.get $3)
)
)
(tuple.make
(i64.const 0)
(f32.const 0)
(i32.const 0)
(i32.const 0)
(i32.const 0)
)
)
(func $1 (type $1) (result f32)
(f32.convert_i32_s
(i32.const 0)
)
)
) STR:
|
Oof, I'm not sure how we should handle locals that have been optimized out in the stack IR, actually. Do you have an idea? |
I think it might be tricky to combine the two. Really the tuples should be lowered before Stack IR (is in Poppy IR), to make this clean. But for now, how about disabling local2stack on tuple locals in StackIR? The benefit of that optimization is around 1-2% at best, while your benefit here is quite large. |
In general, the binary lowering of tuple.extract expects that all the tuple values are on top of the stack, so it inserts drops and possibly uses a scratch local to ensure only the extracted value is left. However, when the extracted tuple expression is a local.get, local.tee, or global.get, it's much more efficient to change the lowering of the get or tee to ensure that only the extracted value is on the stack to begin with. Implement that optimization in the binary writer.
…ly#5941)" (WebAssembly#5945) This reverts commit 56ce1ea. The binary writer optimization is not always correct when stack IR optimizations have run. Revert the change until we can fix it.
In general, the binary lowering of tuple.extract expects that all the tuple
values are on top of the stack, so it inserts drops and possibly uses a scratch
local to ensure only the extracted value is left. However, when the extracted
tuple expression is a local.get, local.tee, or global.get, it's much more
efficient to change the lowering of the get or tee to ensure that only the
extracted value is on the stack to begin with. Implement that optimization in
the binary writer.