-
Notifications
You must be signed in to change notification settings - Fork 3.4k
simd_memcpy #4127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simd_memcpy #4127
Conversation
Initial perf results at http://clb.demon.fi/dump/simd_memcpy/results.html. |
…o, by unrolling, and by using SIMD.
Updated this to latest, which resolves the SIMD import, and adds a test and benchmarks. The results are as follow. Before the new version:
After the new version:
We still have some slowness especially in the ~4k - 16k bytes copy range, but not sure how to improve the code any further. This should be good to merge. |
Related to this, https://bugzilla.mozilla.org/show_bug.cgi?id=1254573 discusses whether adding a memcpy opcode directly in wasm would be a good idea. |
Here's how the performance graph (MBytes/sec bandwidth) looks like on aligned memcpy loads after the PR: http://clb.demon.fi/emcc/memcpy_test/results.html. (Click on the labels to hide/show particular workload). With SIMD we are about on par with native on copies of <= 512 bytes, and on copies >= 20480 bytes, but in between this range native performance can be anything like 3x faster than asm.js+SIMD. In Wasm (no SIMD), we are on par with native on copies of <= 192 bytes, and on copies >= 65536 bytes, and have a similar kind of performance gap between this range. |
Another thing to note is that after the recent adoption of garbage free WebGL 2 entry points, the HEAP8.set() from emscripten_memcpy_big is the single largest source of garbage in two applications I've recently been profiling. |
@@ -599,7 +599,7 @@ def make_emulated_param(i): | |||
if metadata['simdUint32x4']: | |||
simdinttypes += ['Uint32x4'] | |||
simdintfloatfuncs += ['fromUint32x4Bits'] | |||
if metadata['simdInt32x4']: | |||
if metadata['simdInt32x4'] or settings['SIMD']: # Always import Int32x4 when building with -s SIMD=1, since memcpy is SIMD optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can simdInt32x4 be false but SIMD be true? (if not, then the if could be simplified to just check for SIMD?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can. It is possible that the application itself does not use Int32x4 SIMD type at all so SIMD=1 and simdInt32x4=0. Also it is possible to be the other way around, simdInt32x4=1 and SIMD=0, because simdInt32x4 is set to one when compiled .bc code is detected to contain SIMD constructs (the .bc could have been compiled earlier with different flags), but that is orthogonal to what the user set on the command line for -s SIMD=1.
@@ -626,7 +626,7 @@ def make_emulated_param(i): | |||
fundamentals = ['Math'] | |||
fundamentals += ['Int8Array', 'Int16Array', 'Int32Array', 'Uint8Array', 'Uint16Array', 'Uint32Array', 'Float32Array', 'Float64Array'] | |||
fundamentals += ['NaN', 'Infinity'] | |||
if metadata['simd']: | |||
if metadata['simd'] or settings['SIMD']: # Always import SIMD when building with -s SIMD=1, since in that mode memcpy is SIMD optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here - doesn't one of those imply the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the same, simd=1 refers to whether the input .bc code contained SIMD constructs in general or not, and SIMD=1 means whether user wants to target SIMD now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks, I forgot how those worked.
tests/test_benchmark.py
Outdated
@@ -582,6 +583,31 @@ def output_parser(output): | |||
return 100.0/float(re.search('Unrolled Single Precision +([\d\.]+) Mflops', output).group(1)) | |||
self.do_benchmark('linpack_float', open(path_from_root('tests', 'linpack.c')).read(), '''Unrolled Single Precision''', force_c=True, output_parser=output_parser, shared_args=['-DSP']) | |||
|
|||
def test_memcpy_128b(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please disable these by default (zzz_test_*
, as it looks for test_
). unless you think we should constantly monitor these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let me keep the 16k copy case, because that is where we are the slowest.
tests/test_core.py
Outdated
@@ -2129,6 +2129,9 @@ def test_memcpy2(self): | |||
def test_memcpy3(self): | |||
self.do_run_in_out_file_test('tests', 'core', 'test_memcpy3') | |||
|
|||
def test_memcpy_alignment(self): | |||
self.do_run(open(path_from_root('tests', 'test_memcpy.cpp'), 'r').read(), 'OK.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe name the cpp file test_memcpy_alignment.cpp
to match the test name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check.
lgtm with those things fixed. |
…enchmarks where Emscripten is relatively good compared to native.
Optimize memcpy() by reducing three arithmetic ops in loop body to two, by unrolling, and by using SIMD. Yields up to 10x performance increase.
Threw this together quickly just for testing. Not yet to merge. A remaining problem is that the added __deps directive to SIMD does not work correctly, but building it gives
warning: unresolved symbol: SIMD_Int32x4_load
warning: unresolved symbol: SIMD_Int32x4_store
and the code needs to otherwise refer to these functions in order for them to be pulled in.
Other remaining work is to actually test correctness, benchmark the unaligned case, and add to use SIMD for the unaligned case as well, since that is allowed by the SIMD.js spec.