Skip to content

simd_memcpy #4127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 1, 2017
Merged

simd_memcpy #4127

merged 2 commits into from
Feb 1, 2017

Conversation

juj
Copy link
Collaborator

@juj juj commented Feb 25, 2016

Optimize memcpy() by reducing three arithmetic ops in loop body to two, by unrolling, and by using SIMD. Yields up to 10x performance increase.

Threw this together quickly just for testing. Not yet to merge. A remaining problem is that the added __deps directive to SIMD does not work correctly, but building it gives

warning: unresolved symbol: SIMD_Int32x4_load
warning: unresolved symbol: SIMD_Int32x4_store
and the code needs to otherwise refer to these functions in order for them to be pulled in.

Other remaining work is to actually test correctness, benchmark the unaligned case, and add to use SIMD for the unaligned case as well, since that is allowed by the SIMD.js spec.

@juj
Copy link
Collaborator Author

juj commented Feb 25, 2016

Initial perf results at http://clb.demon.fi/dump/simd_memcpy/results.html.

@juj juj mentioned this pull request Feb 29, 2016
@juj
Copy link
Collaborator Author

juj commented Feb 1, 2017

Updated this to latest, which resolves the SIMD import, and adds a test and benchmarks. The results are as follow.

Before the new version:

test_memcpy_128b (test_benchmark.benchmark) ...
        clang: mean: 2.623 (+-0.063) secs  median: 2.578  range: 2.571-2.711  (noise: 2.402%)  (3 runs)
     sm-asmjs: mean: 6.992 (+-0.565) secs  median: 6.602  range: 6.447-7.771  (noise: 8.086%)  (3 runs)   Relative: 2.67 X slower
      sm-wasm: mean: 5.853 (+-0.595) secs  median: 5.436  range: 5.344-6.687  (noise: 10.16%)  (3 runs)   Relative: 2.23 X slower

test_memcpy_4k (test_benchmark.benchmark) ...
        clang: mean: 0.625 (+-0.038) secs  median: 0.598  range: 0.591-0.679  (noise: 6.147%)  (3 runs)
     sm-asmjs: mean: 4.524 (+-0.071) secs  median: 4.482  range: 4.436-4.608  (noise: 1.559%)  (3 runs)   Relative: 7.24 X slower
      sm-wasm: mean: 2.536 (+-0.139) secs  median: 2.458  range: 2.354-2.690  (noise: 5.463%)  (3 runs)   Relative: 4.05 X slower

test_memcpy_16k (test_benchmark.benchmark) ...
        clang: mean: 0.140 (+-0.026) secs  median: 0.123  range: 0.113-0.176  (noise: 18.55%)  (3 runs)
     sm-asmjs: mean: 0.908 (+-0.085) secs  median: 0.853  range: 0.808-1.017  (noise: 9.415%)  (3 runs)   Relative: 6.46 X slower
      sm-wasm: mean: 0.859 (+-0.035) secs  median: 0.839  range: 0.812-0.897  (noise: 4.108%)  (3 runs)   Relative: 6.11 X slower

test_memcpy_1mb (test_benchmark.benchmark) ...
        clang: mean: 1.449 (+-0.055) secs  median: 1.415  range: 1.386-1.519  (noise: 3.762%)  (3 runs)
     sm-asmjs: mean: 1.947 (+-0.159) secs  median: 1.834  range: 1.824-2.171  (noise: 8.165%)  (3 runs)   Relative: 1.34 X slower
      sm-wasm: mean: 1.798 (+-0.054) secs  median: 1.768  range: 1.727-1.856  (noise: 2.982%)  (3 runs)   Relative: 1.24 X slower

test_memcpy_16mb (test_benchmark.benchmark) ...
        clang: mean: 3.835 (+-0.083) secs  median: 3.778  range: 3.750-3.948  (noise: 2.176%)  (3 runs)
     sm-asmjs: mean: 3.862 (+-0.110) secs  median: 3.789  range: 3.748-4.010  (noise: 2.837%)  (3 runs)   Relative: 1.01 X slower
      sm-wasm: mean: 3.929 (+-0.016) secs  median: 3.918  range: 3.914-3.951  (noise: 0.404%)  (3 runs)   Relative: 1.02 X slower

After the new version:

test_memcpy_128b (test_benchmark.benchmark) ...
        clang: mean: 2.731 (+-0.205) secs  median: 2.598  range: 2.496-2.996  (noise: 7.524%)  (3 runs)
     sm-asmjs: mean: 7.433 (+-0.400) secs  median: 7.227  range: 6.891-7.845  (noise: 5.382%)  (3 runs)   Relative: 2.72 X slower
      sm-simd: mean: 6.720 (+-0.302) secs  median: 6.558  range: 6.316-7.044  (noise: 4.500%)  (3 runs)   Relative: 2.46 X slower
      sm-wasm: mean: 6.234 (+-0.439) secs  median: 5.924  range: 5.907-6.855  (noise: 7.042%)  (3 runs)   Relative: 2.28 X slower

test_memcpy_4k (test_benchmark.benchmark) ...
        clang: mean: 0.689 (+-0.031) secs  median: 0.677  range: 0.645-0.712  (noise: 4.491%)  (3 runs)
     sm-asmjs: mean: 1.431 (+-0.088) secs  median: 1.396  range: 1.307-1.502  (noise: 6.171%)  (3 runs)   Relative: 2.08 X slower
      sm-simd: mean: 1.011 (+-0.031) secs  median: 0.999  range: 0.968-1.036  (noise: 3.030%)  (3 runs)   Relative: 1.47 X slower
      sm-wasm: mean: 1.206 (+-0.050) secs  median: 1.176  range: 1.142-1.265  (noise: 4.166%)  (3 runs)   Relative: 1.75 X slower

test_memcpy_16k (test_benchmark.benchmark) ...
        clang: mean: 0.140 (+-0.002) secs  median: 0.139  range: 0.138-0.142  (noise: 1.342%)  (3 runs)
     sm-asmjs: mean: 0.735 (+-0.088) secs  median: 0.694  range: 0.612-0.817  (noise: 12.01%)  (3 runs)   Relative: 5.25 X slower
      sm-simd: mean: 0.366 (+-0.024) secs  median: 0.349  range: 0.342-0.399  (noise: 6.562%)  (3 runs)   Relative: 2.62 X slower
      sm-wasm: mean: 0.823 (+-0.023) secs  median: 0.807  range: 0.799-0.854  (noise: 2.810%)  (3 runs)   Relative: 5.89 X slower

test_memcpy_1mb (test_benchmark.benchmark) ...
        clang: mean: 1.363 (+-0.064) secs  median: 1.332  range: 1.275-1.426  (noise: 4.721%)  (3 runs)
     sm-asmjs: mean: 2.009 (+-0.129) secs  median: 1.935  range: 1.843-2.157  (noise: 6.429%)  (3 runs)   Relative: 1.47 X slower
      sm-simd: mean: 1.827 (+-0.155) secs  median: 1.741  range: 1.623-1.999  (noise: 8.499%)  (3 runs)   Relative: 1.34 X slower
      sm-wasm: mean: 1.755 (+-0.025) secs  median: 1.744  range: 1.721-1.778  (noise: 1.408%)  (3 runs)   Relative: 1.29 X slower

test_memcpy_16mb (test_benchmark.benchmark) ...
        clang: mean: 3.904 (+-0.056) secs  median: 3.868  range: 3.841-3.977  (noise: 1.434%)  (3 runs)
     sm-asmjs: mean: 4.010 (+-0.075) secs  median: 3.975  range: 3.907-4.081  (noise: 1.863%)  (3 runs)   Relative: 1.03 X slower
      sm-simd: mean: 4.029 (+-0.065) secs  median: 4.000  range: 3.938-4.087  (noise: 1.625%)  (3 runs)   Relative: 1.03 X slower
      sm-wasm: mean: 4.009 (+-0.080) secs  median: 3.971  range: 3.899-4.086  (noise: 1.986%)  (3 runs)   Relative: 1.03 X slower

We still have some slowness especially in the ~4k - 16k bytes copy range, but not sure how to improve the code any further. This should be good to merge.

@juj
Copy link
Collaborator Author

juj commented Feb 1, 2017

Related to this, https://bugzilla.mozilla.org/show_bug.cgi?id=1254573 discusses whether adding a memcpy opcode directly in wasm would be a good idea.

@juj
Copy link
Collaborator Author

juj commented Feb 1, 2017

Here's how the performance graph (MBytes/sec bandwidth) looks like on aligned memcpy loads after the PR: http://clb.demon.fi/emcc/memcpy_test/results.html. (Click on the labels to hide/show particular workload).

With SIMD we are about on par with native on copies of <= 512 bytes, and on copies >= 20480 bytes, but in between this range native performance can be anything like 3x faster than asm.js+SIMD.

In Wasm (no SIMD), we are on par with native on copies of <= 192 bytes, and on copies >= 65536 bytes, and have a similar kind of performance gap between this range.

@juj
Copy link
Collaborator Author

juj commented Feb 1, 2017

Another thing to note is that after the recent adoption of garbage free WebGL 2 entry points, the HEAP8.set() from emscripten_memcpy_big is the single largest source of garbage in two applications I've recently been profiling.

@@ -599,7 +599,7 @@ def make_emulated_param(i):
if metadata['simdUint32x4']:
simdinttypes += ['Uint32x4']
simdintfloatfuncs += ['fromUint32x4Bits']
if metadata['simdInt32x4']:
if metadata['simdInt32x4'] or settings['SIMD']: # Always import Int32x4 when building with -s SIMD=1, since memcpy is SIMD optimized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can simdInt32x4 be false but SIMD be true? (if not, then the if could be simplified to just check for SIMD?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can. It is possible that the application itself does not use Int32x4 SIMD type at all so SIMD=1 and simdInt32x4=0. Also it is possible to be the other way around, simdInt32x4=1 and SIMD=0, because simdInt32x4 is set to one when compiled .bc code is detected to contain SIMD constructs (the .bc could have been compiled earlier with different flags), but that is orthogonal to what the user set on the command line for -s SIMD=1.

@@ -626,7 +626,7 @@ def make_emulated_param(i):
fundamentals = ['Math']
fundamentals += ['Int8Array', 'Int16Array', 'Int32Array', 'Uint8Array', 'Uint16Array', 'Uint32Array', 'Float32Array', 'Float64Array']
fundamentals += ['NaN', 'Infinity']
if metadata['simd']:
if metadata['simd'] or settings['SIMD']: # Always import SIMD when building with -s SIMD=1, since in that mode memcpy is SIMD optimized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here - doesn't one of those imply the other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the same, simd=1 refers to whether the input .bc code contained SIMD constructs in general or not, and SIMD=1 means whether user wants to target SIMD now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks, I forgot how those worked.

@@ -582,6 +583,31 @@ def output_parser(output):
return 100.0/float(re.search('Unrolled Single Precision +([\d\.]+) Mflops', output).group(1))
self.do_benchmark('linpack_float', open(path_from_root('tests', 'linpack.c')).read(), '''Unrolled Single Precision''', force_c=True, output_parser=output_parser, shared_args=['-DSP'])

def test_memcpy_128b(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please disable these by default (zzz_test_*, as it looks for test_). unless you think we should constantly monitor these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let me keep the 16k copy case, because that is where we are the slowest.

@@ -2129,6 +2129,9 @@ def test_memcpy2(self):
def test_memcpy3(self):
self.do_run_in_out_file_test('tests', 'core', 'test_memcpy3')

def test_memcpy_alignment(self):
self.do_run(open(path_from_root('tests', 'test_memcpy.cpp'), 'r').read(), 'OK.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name the cpp file test_memcpy_alignment.cpp to match the test name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check.

@kripken
Copy link
Member

kripken commented Feb 1, 2017

lgtm with those things fixed.

…enchmarks where Emscripten is relatively good compared to native.
@juj juj merged commit 7a84ba0 into emscripten-core:incoming Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants