Skip to content

wasm-opt wasm port performance #6338

@kripken

Description

@kripken

This issue is about porting wasm-opt itself to wasm, and how fast it runs there.

The main blocker here was that it was 10x slower than a native build:

emscripten-core/emscripten#15727

That turned out to be an issue with multithreading, and is fixed by using Emscripten's recent mimalloc port (see last comments in that issue). Now instead of 10x slower it is 2x slower, which is not bad considering it is managing to saturate all CPU cores like a native build.

Looking more into the remaining 2x, that factor remains even when pinning both builds to a single core, so multithreading is not related. Some quick profiling:

C++ symbols above 1% (linux perf)

Overhead  Command   Shared Object         Symbol
   6.07%  wasm-opt  libbinaryen.so        [.] wasm::LocalGraphInternal::Flower::flow(wasm::Function*)
   3.91%  wasm-opt  libc.so.6             [.] _int_malloc
   2.92%  wasm-opt  libc.so.6             [.] _int_free
   2.75%  wasm-opt  libc.so.6             [.] malloc
   1.98%  wasm-opt  libc.so.6             [.] malloc_consolidate
   1.61%  wasm-opt  libbinaryen.so        [.] std::__detail::_Map_base<wasm::LocalGet*, std::pair<wasm::LocalGet* const, wasm::SmallSet<wasm::LocalSet*, 2ul> >, std::allocator<std::pair<wasm
   1.47%  wasm-opt  libc.so.6             [.] cfree@GLIBC_2.2.5
   1.39%  wasm-opt  libbinaryen.so        [.] wasm::LocalGraph::computeSetInfluences()
   1.21%  wasm-opt  libbinaryen.so        [.] wasm::(anonymous namespace)::RedundantSetElimination::flowValues(wasm::Function*)
   1.17%  wasm-opt  libbinaryen.so        [.] wasm::PostWalker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::scan(wasm::BranchUtil
   1.14%  wasm-opt  libc.so.6             [.] unlink_chunk.constprop.0
   1.06%  wasm-opt  libstdc++.so.6.0.32   [.] std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)

Wasm symbols above 1% (node --prof)

 [Summary]:
   ticks  total  nonlib   name
   9241   86.9%   90.9%  JavaScript
    575    5.4%    5.7%  C++
     11    0.1%    0.1%  GC
    476    4.5%          Shared libraries
    348    3.3%          Unaccounted

In node there is some 5-10% overhead that is not compiled wasm code, and which seems to be V8 compiling the wasm as best I can tell, but that is minor compared to the 2x slowdown so we can ignore it. The compiled wasm breakdown looks like this:

 [JavaScript]:
   ticks  total  nonlib   name
    642    6.0%    6.3%  Function: *Walker<ReFinalize, OverriddenVisitor<ReFinalize, void>>::pushTask(void (*)(ReFinalize*, Expression**), Expression**)
    554    5.2%    5.5%  Function: *void SmallVector<Walker<ReFinalize, OverriddenVisitor<ReFinalize, void>>::Task, 10ul>::emplace_back<void (*&)(ReFinalize*, Expression**), Expression**&>(void (*&)(ReFinalize*, Expression**), Expression**&)
    437    4.1%    4.3%  Function: *LocalGraphInternal::Flower::flow(Function*)
    369    3.5%    3.6%  Function: *mi_malloc
    189    1.8%    1.9%  Function: *mi_free
    150    1.4%    1.5%  Function: *PostWalker<BranchUtils::BranchSeeker, UnifiedExpressionVisitor<BranchUtils::BranchSeeker, void>>::scan(BranchUtils::BranchSeeker*, Expression**)
    123    1.2%    1.2%  Function: *void SmallVector<Walker<TranslateToFuzzReader::recombine(Function*)::Scanner, UnifiedExpressionVisitor<TranslateToFuzzReader::recombine(Function*)::Scanner, void>>::Task, 10ul>::emplace_back<void (*&)(TranslateToFuzzReader::recombine(Function*)::Scanner*, Expression**), Expression**&>(void (*&)(TranslateToFuzzReader::recombine(Function*)::Scanner*, Expression**), Expression**&)
    113    1.1%    1.1%  Function: *std::__2::pair<std::__2::__hash_iterator<std::__2::__hash_node<std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, void*>*>, bool> std::__2::__hash_table<std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, std::__2::__unordered_map_hasher<LocalGet*, std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, std::__2::hash<LocalGet*>, std::__2::equal_to<LocalGet*>, true>, std::__2::__unordered_map_equal<LocalGet*, std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>, std::__2::equal_to<LocalGet*>, std::__2::hash<LocalGet*>, true>, std::__2::allocator<std::__2::__hash_value_type<LocalGet*, SmallSet<LocalSet*, 2ul>>>>::__emplace_unique_key_args<LocalGet*, std::__2::piecewise_construct_t const&, std::__2::tuple<LocalGet* const&>, std::__2::tuple<>>(LocalGet* const&, std::__2::piecewise_construct_t const&, std::__2::tuple<LocalGet* const&>&&, std::__2::tuple<>&&)
    109    1.0%    1.1%  Function: *SimplifyLocals<true, true, true>::visitPost(SimplifyLocals<true, true, true>*, Expression**)
    101    0.9%    1.0%  Function: *CFGWalker<LocalGraphInternal::Flower, Visitor<LocalGraphInternal::Flower, void>, LocalGraphInternal::Info>::scan(LocalGraphInternal::Flower*, Expression**)

LocalGraph is high in both, as is malloc/free, which all makes sense. More suspicious is that ReFinalize takes over 10% in wasm while in native C++ it is hardly noticeable:

   0.81%  wasm-opt  wasm-opt              [.] wasm::PostWalker<wasm::ReFinalize, wasm::OverriddenVisitor<wasm::ReFinalize, void> >::scan(wasm::ReFinalize*, wasm::Expression**)
   0.58%  wasm-opt  wasm-opt              [.] wasm::Walker<wasm::ReFinalize, wasm::OverriddenVisitor<wasm::ReFinalize, void> >::pushTask(void (*)(wasm::ReFinalize*, wasm::Expression**), wasm

That might be related to inlining differences, if ReFinalize does lots of malloc/free which show up in C++ as malloc/free but are inlined in wasm and end up there as the calling function.

ReFinalize does not do that much work beyond a quick walk of the module, so this suggests all walks are slower in wasm...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions