Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor performance of trigonometric functions #19284

Open
JeromeDesfieux opened this issue May 3, 2023 · 14 comments
Open

Very poor performance of trigonometric functions #19284

JeromeDesfieux opened this issue May 3, 2023 · 14 comments

Comments

@JeromeDesfieux
Copy link
Contributor

JeromeDesfieux commented May 3, 2023

Emscripten version: 3.1.35
Tests done on Windows desktop i7 12th 32GB RAM (native compiler is MSVC 2022, brower Chrome for wasm tests)
Build configuration is RelWithDebInfo (O2) for native and wasm builds

I am doing some benchmark and I am very surprised with the results I have regarding the performance of the trigonometric functions (math.h).

Basically I wrote a program doing a conbination of atan2, cos and sin that I execute on a test data.
I have the following results on my computer:

  • Native C++: 1.7 ms
  • WebAssembly: 3 ms
  • JavaScript: 2 ms

I tried many other approach for this test (including Catch2 benchmark and using a non-sorted dataset) and I have always very bad results with WebAssembly (approx. 50% slower than JS).

Is there is a known reason for that ? Can I improve ?

(Please find attached the source code)
Benchmark_trigo.zip

@sbc100
Copy link
Collaborator

sbc100 commented May 3, 2023

When you say JavaScript: 2 ms are you referring to the output of emcc -sWASM=0, or some other implementation written directly in JS?

@JeromeDesfieux
Copy link
Contributor Author

It is a custom js code inside a EM_ASM_ block which is as much close as possible to the c++ one.
If my understanding is good, that code is executed as-is by the JavaScript engine. Is it?

@sbc100
Copy link
Collaborator

sbc100 commented May 3, 2023

Yes, although the difference here is that JavaScript has trig functions builtin (e.g. Math.atan2), but IIRC WebAssembly does not. So you your C/C++ code will call into atan2() code which is implemented in userspace in terms of lower level math function: https://github.com/emscripten-core/emscripten/blob/main/system/lib/libc/musl/src/math/atan.c
Without investigating further I would guess that this is part of the reason for the differences you are seeing.

@kripken
Copy link
Member

kripken commented May 3, 2023

@JeromeDesfieux Out of curiosity, did you try with the linker flag -sJS_MATH ? That makes the wasm use JS's math functions. The overhead of calling out to JS might make it not worth it (this option is usually for code size), but it's worth comparing.

@JeromeDesfieux
Copy link
Contributor Author

@kripken As anticipated, using the linker flag -sJS_MATH gives worst results (WASM is then 3 times slower than JS).

@JeromeDesfieux
Copy link
Contributor Author

JeromeDesfieux commented May 4, 2023

I tried using O3 instead of O2 and I tried adding -msimd128 compiler flag without noticeable improvments.

I must admit that it makes our use case for WebAssembly compromised...
Our application is a graphic app which is doing a lot of trigonomy in framerate. Most of this trig computations are done on CPU during a kind of pre-culling phase that we can't push to the GPU...

Do you have any ideas of what I can do to improve this ? I guess I would need to rewrite that musl part of code with faster (and less accurate) algo ? Or is it a possibility that WebAssembly provides builtin implementation of those functions ?

Thanks for your help

Note that I will try using other math lib (like gml or gcem). I will post the bench result.

(I also found this StackOverFlow discussion where I posted a link to this issue:
https://stackoverflow.com/a/76171112/19077851)

@dschuff
Copy link
Member

dschuff commented May 5, 2023

@kripken it looks like the implementation of -sJS_MATH just calls directly out to the relevant JS math functions via EM_JS, so it seems very odd that it's noticeably worse than custom EM_ASM that does the same thing. That seems worth investigating, since it would seem like that's exactly what we'd want in this situation.

@JeromeDesfieux
Copy link
Contributor Author

JeromeDesfieux commented May 5, 2023

In fact the auto test I am doing is to execute multiple (a lot) of computation inside the same block I measure:

        auto start = std::chrono::high_resolution_clock::now();
        double out{0.};
        for(auto i=0 ; i < benchmarkData.size()-1 ; i+=1)
        {
            const double sinI = sin(benchmarkData[i]);
            for(auto j=0 ; j < benchmarkData.size()-1 ; j+=1)
                out += sinI * cos(benchmarkData[j]);
        }
        auto end = std::chrono::high_resolution_clock::now();

So there is a overhead for each call to cos/sin if using -sJS_MATH.

In the other hand the custom EM_ASM code wraps the total loop (so only one EM_ASM overhead for the whole test -> my goal being to measure JavaScript time):

        double time_taken = EM_ASM_DOUBLE(({
                let benchmarkData = [];
                const nbData = 10000;
                for(let i=0; i<nbData; ++i)
                    benchmarkData[i] = i;

                let out = 0;
                const start = Date.now();
                for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
                {
                    const sinI = Math.sin(benchmarkData[i]);
                    for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
                        out += sinI * Math.cos(benchmarkData[j]);
                }
                const end = Date.now();
                console.log("[JS] (out is " + out + ")");

                return end-start;
            }));

@dschuff
Copy link
Member

dschuff commented May 5, 2023

Ah ok yeah that makes sense, thanks.

@kripken
Copy link
Member

kripken commented May 5, 2023

@JeromeDesfieux Interesting, thanks for the info.

Overall, I think the question is whether native code has some ability wasm doesn't. I mean, it's possible atan etc. in your libc is using some inline assembly CPU magic that can't be expressed in wasm. To get a more apples-to-apples comparison, maybe provide the atan etc. implementation in the source files you build. That is, don't rely on libc to implement it, but use some C++ library that you can compile to both native and wasm. (Just adding source files to compile+link should work - they will override the libc versions.)

If there is still a difference, then perhaps there is something we can improve on compiling that library. If there isn't a difference, then you would be able to get good wasm performance by using that math library instead of the default math code in emscripten's libc (which is basically musl, and like all codebases has some compromises between size and speed etc. - so you may be able to do better, for your use case, with another math library).

@gl84
Copy link
Contributor

gl84 commented May 7, 2023

@JeromeDesfieux: When benchmarking Native vs. WebAssembly math functions keep in mind that in emscripten sometimes speed is traded for code size.

See for e.g. #15544.

Disclaimer: I don't know the implementation details for musl trigonometric functions.

@JeromeDesfieux
Copy link
Contributor Author

Thanks all for your help.

I changed my autotest to making it more accurate (bigger dataset, more iterations) and to target only one trig function (cosine).

I tried other implementations (gcem and V8 one) with very similar results --> WebAssembly is always [30-40]% slower than native C++. Note that I have other benchmarks about arithmetics where wasm is very very close to C++ native so I guess there is something very specific with those trigonometric functions.

When I test the V8 algo, I copied the whole cosine code in my cpp so it is exactly the same code which compiles on native and wasm, but I still see this +30%. When looking at the code I can see that it uses a lot of binary operations (&, >>, <<, ...).
Can those operators be the source of the slowness ?

Example of such macro used in cosine computation:

#define GET_HIGH_WORD(i, d)                      \  
  do {                                           \  
    uint64_t bits = base::bit_cast<uint64_t>(d); \  
    (i) = bits >> 32;                            \  
  } while (false)

// with bit_cast being:
template <class Dest, class Source>
inline Dest bit_cast(Source const& source) {
    static_assert(sizeof(Dest) == sizeof(Source),
                  "source and dest must be same size");
    Dest dest;
    memcpy(&dest, &source, sizeof(dest));
    return dest;
}

@kripken
Copy link
Member

kripken commented May 9, 2023

I would expect bitwise operations like << to be at full speed in wasm. It's things like memory access and calls, things that require sandboxing, that can be slower.

What wasm does base::bit_cast lower into?

(Otherwise, I think inspecting the machine code in both native and wasm is really the only way to understand a 30-40% slowdown. Comparing to other VMs might also be helpful to get context.)

@anonyco
Copy link

anonyco commented Nov 18, 2023

Use handwritten asm.js to speed up critical code using trigonometric functions.

// Original benchmark code
(function() {
	let benchmarkData = [];
	const nbData = 1000;
	for(let i=0; i<nbData; ++i)
		benchmarkData[i] = i;

	let out = 0;
	const start = Date.now();
	for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
	{
		const sinI = Math.sin(benchmarkData[i]);
		for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
			out += Math.atan2(sinI, Math.cos(benchmarkData[j]));
	}
	const end = Date.now();
	console.log("[JS] (out is " + out + ")");

	return end-start; // takes 135ms in Chrome on my machine
})();

Below is my handwritten asm.js. It's 3 times faster than the former Javascript.

(function(console) {
	"use strict";
	
	const nbData = 1000;
	var benchmarkData = new Float64Array(16384);
	for(let i=0; i<nbData; ++i)
		benchmarkData[i] = i;

	function asmModuleInit(stdlib, foreign, heap) {
		"use asm";
		// shared variables
		var fround = stdlib.Math.fround;
		var sin = stdlib.Math.sin;
		var cos = stdlib.Math.cos;
		var atan2 = stdlib.Math.atan2;
		var benchmarkData = new stdlib.Float64Array(heap);
		
		// function declarations
		function benchmark(len) {
			len = len | 0;
			var i=0, j=0;
			var out=0.0, sinI=0.0;
			len = len << 3;
			for(i=0; (i|0) != (len|0); i=i+8|0) {
				sinI = sin(+benchmarkData[i >> 3]);
				for(j=0; (j|0) != (len|0); j=j+8|0)
					out = out + atan2(sinI, cos(+benchmarkData[j >> 3]));
			}
			return out;
		}
		
		// export function
		return { benchmark: benchmark };
	}
	const asmModule = asmModuleInit(window, {}, benchmarkData.buffer);

	var len = nbData-1|0;
	asmModule.benchmark(len); // warm up because chrome doesn't compile AOT 
	
	console.time("asm.js code");
	asmModule.benchmark(len);
	console.timeEnd("asm.js code"); // takes 44ms in Chrome on my machine
})(console);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants