wasm-opt wasm port performance

This issue is about porting wasm-opt itself to wasm, and how fast it runs there.

The main blocker here was that it was 10x slower than a native build:

https://github.com/emscripten-core/emscripten/issues/15727

That turned out to be an issue with multithreading, and is fixed by using Emscripten's recent `mimalloc` port (see last comments in that issue). Now instead of 10x slower it is 2x slower, which is not bad considering it is managing to saturate all CPU cores like a native build.

Looking more into the remaining 2x, that factor remains even when pinning both builds to a single core, so multithreading is not related. Some quick profiling:

## C++ symbols above 1% (linux perf)

```
Overhead  Command   Shared Object         Symbol
   6.07%  wasm-opt  libbinaryen.so        [.] wasm::LocalGraphInternal::Flower::flow(wasm::Function*)
   3.91%  wasm-opt  libc.so.6             [.] _int_malloc
   2.92%  wasm-opt  libc.so.6             [.] _int_free
   2.75%  wasm-opt  libc.so.6             [.] malloc
   1.98%  wasm-opt  libc.so.6             [.] malloc_consolidate
   1.61%  wasm-opt  libbinaryen.so        [.] std::__detail::_Map_base<wasm::LocalGet*, std::pair<wasm::LocalGet* const, wasm::SmallSet<wasm::LocalSet*, 2ul> >, std::allocator<std::pair<wasm
   1.47%  wasm-opt  libc.so.6             [.] cfree@GLIBC_2.2.5
   1.39%  wasm-opt  libbinaryen.so        [.] wasm::LocalGraph::computeSetInfluences()
   1.21%  wasm-opt  libbinaryen.so        [.] wasm::(anonymous namespace)::RedundantSetElimination::flowValues(wasm::Function*)
   1.17%  wasm-opt  libbinaryen.so        [.] wasm::PostWalker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::scan(wasm::BranchUtil
   1.14%  wasm-opt  libc.so.6             [.] unlink_chunk.constprop.0
   1.06%  wasm-opt  libstdc++.so.6.0.32   [.] std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)
```

## Wasm symbols above 1% (node --prof)

```
 [Summary]:
   ticks  total  nonlib   name
   9241   86.9%   90.9%  JavaScript
    575    5.4%    5.7%  C++
     11    0.1%    0.1%  GC
    476    4.5%          Shared libraries
    348    3.3%          Unaccounted
```

In node there is some 5-10% overhead that is not compiled wasm code, and which seems to be V8 compiling the wasm as best I can tell, but that is minor compared to the 2x slowdown so we can ignore it. The compiled wasm breakdown looks like this:

```
 [JavaScript]:
   ticks  total  nonlib   name
    642    6.0%    6.3%  Function: *Walker<ReFinalize, OverriddenVisitor<ReFinalize, void>>::pushTask(void (*)(ReFinalize*, Expression**), Expression**)
    554    5.2%    5.5%  Function: *void SmallVector<Walker<ReFinalize, OverriddenVisitor<ReFinalize, void>>::Task, 10ul>::emplace_back<void (*&)(ReFinalize*, Expression**), Expression**&>(void (*&)(ReFinalize*, Expression**), Expression**&)
    437    4.1%    4.3%  Function: *LocalGraphInternal::Flower::flow(Function*)
    369    3.5%    3.6%  Function: *mi_malloc
    189    1.8%    1.9%  Function: *mi_free
    150    1.4%    1.5%  Function: *PostWalker<BranchUtils::BranchSeeker, UnifiedExpressionVisitor<BranchUtils::BranchSeeker, void>>::scan(BranchUtils::BranchSeeker*, Expression**)
    123    1.2%    1.2%  Function: *void SmallVector<Walker<TranslateToFuzzReader::recombine(Function*)::Scanner, UnifiedExpressionVisitor<TranslateToFuzzReader::recombine(Function*)::Scanner, void>>::Task, 10ul>::emplace_back<void (*&)(TranslateToFuzzReader::recombine(Function*)::Scanner*, Expression**), Expression**&>(void (*&)(TranslateToFuzzReader::recombine(Function*)::Scanner*, Expression**), Expression**&)
    113    1.1%    1.1%  Function: *std::__2::pair<std::__2::__hash_iterator<std::__2::__hash_node<std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, void*>*>, bool> std::__2::__hash_table<std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, std::__2::__unordered_map_hasher<LocalGet*, std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, std::__2::hash<LocalGet*>, std::__2::equal_to<LocalGet*>, true>, std::__2::__unordered_map_equal<LocalGet*, std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, std::__2::equal_to<LocalGet*>, std::__2::hash<LocalGet*>, true>, std::__2::allocator<std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>>>::__emplace_unique_key_args<LocalGet*, std::__2::piecewise_construct_t const&, std::__2::tuple<LocalGet* const&>, std::__2::tuple<>>(LocalGet* const&, std::__2::piecewise_construct_t const&, std::__2::tuple<LocalGet* const&>&&, std::__2::tuple<>&&)
    109    1.0%    1.1%  Function: *SimplifyLocals<true, true, true>::visitPost(SimplifyLocals<true, true, true>*, Expression**)
    101    0.9%    1.0%  Function: *CFGWalker<LocalGraphInternal::Flower, Visitor<LocalGraphInternal::Flower, void>, LocalGraphInternal::Info>::scan(LocalGraphInternal::Flower*, Expression**)
```

`LocalGraph` is high in both, as is malloc/free, which all makes sense. More suspicious is that `ReFinalize` takes over 10% in wasm while in native C++ it is hardly noticeable:

```
   0.81%  wasm-opt  wasm-opt              [.] wasm::PostWalker<wasm::ReFinalize, wasm::OverriddenVisitor<wasm::ReFinalize, void> >::scan(wasm::ReFinalize*, wasm::Expression**)
   0.58%  wasm-opt  wasm-opt              [.] wasm::Walker<wasm::ReFinalize, wasm::OverriddenVisitor<wasm::ReFinalize, void> >::pushTask(void (*)(wasm::ReFinalize*, wasm::Expression**), wasm
```

That might be related to inlining differences, if ReFinalize does lots of malloc/free which show up in C++ as malloc/free but are inlined in wasm and end up there as the calling function.

ReFinalize does not do that much work beyond a quick walk of the module, so this suggests all walks are slower in wasm...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wasm-opt wasm port performance #6338

C++ symbols above 1% (linux perf)

Wasm symbols above 1% (node --prof)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wasm-opt wasm port performance #6338

Description

C++ symbols above 1% (linux perf)

Wasm symbols above 1% (node --prof)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions