-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very poor performance of trigonometric functions #19284
Comments
When you say |
It is a custom js code inside a EM_ASM_ block which is as much close as possible to the c++ one. |
Yes, although the difference here is that JavaScript has trig functions builtin (e.g. Math.atan2), but IIRC WebAssembly does not. So you your C/C++ code will call into atan2() code which is implemented in userspace in terms of lower level math function: https://github.com/emscripten-core/emscripten/blob/main/system/lib/libc/musl/src/math/atan.c |
@JeromeDesfieux Out of curiosity, did you try with the linker flag |
@kripken As anticipated, using the linker flag |
I tried using O3 instead of O2 and I tried adding I must admit that it makes our use case for WebAssembly compromised... Do you have any ideas of what I can do to improve this ? I guess I would need to rewrite that musl part of code with faster (and less accurate) algo ? Or is it a possibility that WebAssembly provides builtin implementation of those functions ? Thanks for your help Note that I will try using other math lib (like gml or gcem). I will post the bench result. (I also found this StackOverFlow discussion where I posted a link to this issue: |
@kripken it looks like the implementation of |
In fact the auto test I am doing is to execute multiple (a lot) of computation inside the same block I measure:
So there is a overhead for each call to cos/sin if using In the other hand the custom EM_ASM code wraps the total loop (so only one EM_ASM overhead for the whole test -> my goal being to measure JavaScript time):
|
Ah ok yeah that makes sense, thanks. |
@JeromeDesfieux Interesting, thanks for the info. Overall, I think the question is whether native code has some ability wasm doesn't. I mean, it's possible atan etc. in your libc is using some inline assembly CPU magic that can't be expressed in wasm. To get a more apples-to-apples comparison, maybe provide the atan etc. implementation in the source files you build. That is, don't rely on libc to implement it, but use some C++ library that you can compile to both native and wasm. (Just adding source files to compile+link should work - they will override the libc versions.) If there is still a difference, then perhaps there is something we can improve on compiling that library. If there isn't a difference, then you would be able to get good wasm performance by using that math library instead of the default math code in emscripten's libc (which is basically musl, and like all codebases has some compromises between size and speed etc. - so you may be able to do better, for your use case, with another math library). |
@JeromeDesfieux: When benchmarking Native vs. WebAssembly math functions keep in mind that in emscripten sometimes speed is traded for code size. See for e.g. #15544. Disclaimer: I don't know the implementation details for musl trigonometric functions. |
Thanks all for your help. I changed my autotest to making it more accurate (bigger dataset, more iterations) and to target only one trig function (cosine). I tried other implementations (gcem and V8 one) with very similar results --> WebAssembly is always [30-40]% slower than native C++. Note that I have other benchmarks about arithmetics where wasm is very very close to C++ native so I guess there is something very specific with those trigonometric functions. When I test the V8 algo, I copied the whole cosine code in my cpp so it is exactly the same code which compiles on native and wasm, but I still see this +30%. When looking at the code I can see that it uses a lot of binary operations (&, >>, <<, ...). Example of such macro used in cosine computation: #define GET_HIGH_WORD(i, d) \
do { \
uint64_t bits = base::bit_cast<uint64_t>(d); \
(i) = bits >> 32; \
} while (false)
// with bit_cast being:
template <class Dest, class Source>
inline Dest bit_cast(Source const& source) {
static_assert(sizeof(Dest) == sizeof(Source),
"source and dest must be same size");
Dest dest;
memcpy(&dest, &source, sizeof(dest));
return dest;
} |
I would expect bitwise operations like What wasm does (Otherwise, I think inspecting the machine code in both native and wasm is really the only way to understand a 30-40% slowdown. Comparing to other VMs might also be helpful to get context.) |
Use handwritten asm.js to speed up critical code using trigonometric functions. // Original benchmark code
(function() {
let benchmarkData = [];
const nbData = 1000;
for(let i=0; i<nbData; ++i)
benchmarkData[i] = i;
let out = 0;
const start = Date.now();
for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
{
const sinI = Math.sin(benchmarkData[i]);
for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
out += Math.atan2(sinI, Math.cos(benchmarkData[j]));
}
const end = Date.now();
console.log("[JS] (out is " + out + ")");
return end-start; // takes 135ms in Chrome on my machine
})(); Below is my handwritten asm.js. It's 3 times faster than the former Javascript. (function(console) {
"use strict";
const nbData = 1000;
var benchmarkData = new Float64Array(16384);
for(let i=0; i<nbData; ++i)
benchmarkData[i] = i;
function asmModuleInit(stdlib, foreign, heap) {
"use asm";
// shared variables
var fround = stdlib.Math.fround;
var sin = stdlib.Math.sin;
var cos = stdlib.Math.cos;
var atan2 = stdlib.Math.atan2;
var benchmarkData = new stdlib.Float64Array(heap);
// function declarations
function benchmark(len) {
len = len | 0;
var i=0, j=0;
var out=0.0, sinI=0.0;
len = len << 3;
for(i=0; (i|0) != (len|0); i=i+8|0) {
sinI = sin(+benchmarkData[i >> 3]);
for(j=0; (j|0) != (len|0); j=j+8|0)
out = out + atan2(sinI, cos(+benchmarkData[j >> 3]));
}
return out;
}
// export function
return { benchmark: benchmark };
}
const asmModule = asmModuleInit(window, {}, benchmarkData.buffer);
var len = nbData-1|0;
asmModule.benchmark(len); // warm up because chrome doesn't compile AOT
console.time("asm.js code");
asmModule.benchmark(len);
console.timeEnd("asm.js code"); // takes 44ms in Chrome on my machine
})(console); |
Emscripten version: 3.1.35
Tests done on Windows desktop i7 12th 32GB RAM (native compiler is MSVC 2022, brower Chrome for wasm tests)
Build configuration is RelWithDebInfo (O2) for native and wasm builds
I am doing some benchmark and I am very surprised with the results I have regarding the performance of the trigonometric functions (
math.h
).Basically I wrote a program doing a conbination of atan2, cos and sin that I execute on a test data.
I have the following results on my computer:
I tried many other approach for this test (including Catch2 benchmark and using a non-sorted dataset) and I have always very bad results with WebAssembly (approx. 50% slower than JS).
Is there is a known reason for that ? Can I improve ?
(Please find attached the source code)
Benchmark_trigo.zip
The text was updated successfully, but these errors were encountered: