simd_memcpy #4127

juj · 2016-02-25T15:02:02Z

Optimize memcpy() by reducing three arithmetic ops in loop body to two, by unrolling, and by using SIMD. Yields up to 10x performance increase.

Threw this together quickly just for testing. Not yet to merge. A remaining problem is that the added __deps directive to SIMD does not work correctly, but building it gives

warning: unresolved symbol: SIMD_Int32x4_load
warning: unresolved symbol: SIMD_Int32x4_store
and the code needs to otherwise refer to these functions in order for them to be pulled in.

Other remaining work is to actually test correctness, benchmark the unaligned case, and add to use SIMD for the unaligned case as well, since that is allowed by the SIMD.js spec.

juj · 2016-02-25T15:06:49Z

Initial perf results at http://clb.demon.fi/dump/simd_memcpy/results.html.

…o, by unrolling, and by using SIMD.

juj · 2017-02-01T07:36:29Z

Updated this to latest, which resolves the SIMD import, and adds a test and benchmarks. The results are as follow.

Before the new version:

test_memcpy_128b (test_benchmark.benchmark) ...
        clang: mean: 2.623 (+-0.063) secs  median: 2.578  range: 2.571-2.711  (noise: 2.402%)  (3 runs)
     sm-asmjs: mean: 6.992 (+-0.565) secs  median: 6.602  range: 6.447-7.771  (noise: 8.086%)  (3 runs)   Relative: 2.67 X slower
      sm-wasm: mean: 5.853 (+-0.595) secs  median: 5.436  range: 5.344-6.687  (noise: 10.16%)  (3 runs)   Relative: 2.23 X slower

test_memcpy_4k (test_benchmark.benchmark) ...
        clang: mean: 0.625 (+-0.038) secs  median: 0.598  range: 0.591-0.679  (noise: 6.147%)  (3 runs)
     sm-asmjs: mean: 4.524 (+-0.071) secs  median: 4.482  range: 4.436-4.608  (noise: 1.559%)  (3 runs)   Relative: 7.24 X slower
      sm-wasm: mean: 2.536 (+-0.139) secs  median: 2.458  range: 2.354-2.690  (noise: 5.463%)  (3 runs)   Relative: 4.05 X slower

test_memcpy_16k (test_benchmark.benchmark) ...
        clang: mean: 0.140 (+-0.026) secs  median: 0.123  range: 0.113-0.176  (noise: 18.55%)  (3 runs)
     sm-asmjs: mean: 0.908 (+-0.085) secs  median: 0.853  range: 0.808-1.017  (noise: 9.415%)  (3 runs)   Relative: 6.46 X slower
      sm-wasm: mean: 0.859 (+-0.035) secs  median: 0.839  range: 0.812-0.897  (noise: 4.108%)  (3 runs)   Relative: 6.11 X slower

test_memcpy_1mb (test_benchmark.benchmark) ...
        clang: mean: 1.449 (+-0.055) secs  median: 1.415  range: 1.386-1.519  (noise: 3.762%)  (3 runs)
     sm-asmjs: mean: 1.947 (+-0.159) secs  median: 1.834  range: 1.824-2.171  (noise: 8.165%)  (3 runs)   Relative: 1.34 X slower
      sm-wasm: mean: 1.798 (+-0.054) secs  median: 1.768  range: 1.727-1.856  (noise: 2.982%)  (3 runs)   Relative: 1.24 X slower

test_memcpy_16mb (test_benchmark.benchmark) ...
        clang: mean: 3.835 (+-0.083) secs  median: 3.778  range: 3.750-3.948  (noise: 2.176%)  (3 runs)
     sm-asmjs: mean: 3.862 (+-0.110) secs  median: 3.789  range: 3.748-4.010  (noise: 2.837%)  (3 runs)   Relative: 1.01 X slower
      sm-wasm: mean: 3.929 (+-0.016) secs  median: 3.918  range: 3.914-3.951  (noise: 0.404%)  (3 runs)   Relative: 1.02 X slower

After the new version:

test_memcpy_128b (test_benchmark.benchmark) ...
        clang: mean: 2.731 (+-0.205) secs  median: 2.598  range: 2.496-2.996  (noise: 7.524%)  (3 runs)
     sm-asmjs: mean: 7.433 (+-0.400) secs  median: 7.227  range: 6.891-7.845  (noise: 5.382%)  (3 runs)   Relative: 2.72 X slower
      sm-simd: mean: 6.720 (+-0.302) secs  median: 6.558  range: 6.316-7.044  (noise: 4.500%)  (3 runs)   Relative: 2.46 X slower
      sm-wasm: mean: 6.234 (+-0.439) secs  median: 5.924  range: 5.907-6.855  (noise: 7.042%)  (3 runs)   Relative: 2.28 X slower

test_memcpy_4k (test_benchmark.benchmark) ...
        clang: mean: 0.689 (+-0.031) secs  median: 0.677  range: 0.645-0.712  (noise: 4.491%)  (3 runs)
     sm-asmjs: mean: 1.431 (+-0.088) secs  median: 1.396  range: 1.307-1.502  (noise: 6.171%)  (3 runs)   Relative: 2.08 X slower
      sm-simd: mean: 1.011 (+-0.031) secs  median: 0.999  range: 0.968-1.036  (noise: 3.030%)  (3 runs)   Relative: 1.47 X slower
      sm-wasm: mean: 1.206 (+-0.050) secs  median: 1.176  range: 1.142-1.265  (noise: 4.166%)  (3 runs)   Relative: 1.75 X slower

test_memcpy_16k (test_benchmark.benchmark) ...
        clang: mean: 0.140 (+-0.002) secs  median: 0.139  range: 0.138-0.142  (noise: 1.342%)  (3 runs)
     sm-asmjs: mean: 0.735 (+-0.088) secs  median: 0.694  range: 0.612-0.817  (noise: 12.01%)  (3 runs)   Relative: 5.25 X slower
      sm-simd: mean: 0.366 (+-0.024) secs  median: 0.349  range: 0.342-0.399  (noise: 6.562%)  (3 runs)   Relative: 2.62 X slower
      sm-wasm: mean: 0.823 (+-0.023) secs  median: 0.807  range: 0.799-0.854  (noise: 2.810%)  (3 runs)   Relative: 5.89 X slower

test_memcpy_1mb (test_benchmark.benchmark) ...
        clang: mean: 1.363 (+-0.064) secs  median: 1.332  range: 1.275-1.426  (noise: 4.721%)  (3 runs)
     sm-asmjs: mean: 2.009 (+-0.129) secs  median: 1.935  range: 1.843-2.157  (noise: 6.429%)  (3 runs)   Relative: 1.47 X slower
      sm-simd: mean: 1.827 (+-0.155) secs  median: 1.741  range: 1.623-1.999  (noise: 8.499%)  (3 runs)   Relative: 1.34 X slower
      sm-wasm: mean: 1.755 (+-0.025) secs  median: 1.744  range: 1.721-1.778  (noise: 1.408%)  (3 runs)   Relative: 1.29 X slower

test_memcpy_16mb (test_benchmark.benchmark) ...
        clang: mean: 3.904 (+-0.056) secs  median: 3.868  range: 3.841-3.977  (noise: 1.434%)  (3 runs)
     sm-asmjs: mean: 4.010 (+-0.075) secs  median: 3.975  range: 3.907-4.081  (noise: 1.863%)  (3 runs)   Relative: 1.03 X slower
      sm-simd: mean: 4.029 (+-0.065) secs  median: 4.000  range: 3.938-4.087  (noise: 1.625%)  (3 runs)   Relative: 1.03 X slower
      sm-wasm: mean: 4.009 (+-0.080) secs  median: 3.971  range: 3.899-4.086  (noise: 1.986%)  (3 runs)   Relative: 1.03 X slower

We still have some slowness especially in the ~4k - 16k bytes copy range, but not sure how to improve the code any further. This should be good to merge.

juj · 2017-02-01T07:41:49Z

Related to this, https://bugzilla.mozilla.org/show_bug.cgi?id=1254573 discusses whether adding a memcpy opcode directly in wasm would be a good idea.

juj · 2017-02-01T07:49:37Z

Here's how the performance graph (MBytes/sec bandwidth) looks like on aligned memcpy loads after the PR: http://clb.demon.fi/emcc/memcpy_test/results.html. (Click on the labels to hide/show particular workload).

With SIMD we are about on par with native on copies of <= 512 bytes, and on copies >= 20480 bytes, but in between this range native performance can be anything like 3x faster than asm.js+SIMD.

In Wasm (no SIMD), we are on par with native on copies of <= 192 bytes, and on copies >= 65536 bytes, and have a similar kind of performance gap between this range.

juj · 2017-02-01T07:55:12Z

Another thing to note is that after the recent adoption of garbage free WebGL 2 entry points, the HEAP8.set() from emscripten_memcpy_big is the single largest source of garbage in two applications I've recently been profiling.

kripken · 2017-02-01T19:37:33Z

emscripten.py

@@ -599,7 +599,7 @@ def make_emulated_param(i):
    if metadata['simdUint32x4']:
      simdinttypes += ['Uint32x4']
      simdintfloatfuncs += ['fromUint32x4Bits']
-    if metadata['simdInt32x4']:
+    if metadata['simdInt32x4'] or settings['SIMD']: # Always import Int32x4 when building with -s SIMD=1, since memcpy is SIMD optimized.


can simdInt32x4 be false but SIMD be true? (if not, then the if could be simplified to just check for SIMD?)

Yes, it can. It is possible that the application itself does not use Int32x4 SIMD type at all so SIMD=1 and simdInt32x4=0. Also it is possible to be the other way around, simdInt32x4=1 and SIMD=0, because simdInt32x4 is set to one when compiled .bc code is detected to contain SIMD constructs (the .bc could have been compiled earlier with different flags), but that is orthogonal to what the user set on the command line for -s SIMD=1.

kripken · 2017-02-01T19:38:01Z

emscripten.py

@@ -626,7 +626,7 @@ def make_emulated_param(i):
    fundamentals = ['Math']
    fundamentals += ['Int8Array', 'Int16Array', 'Int32Array', 'Uint8Array', 'Uint16Array', 'Uint32Array', 'Float32Array', 'Float64Array']
    fundamentals += ['NaN', 'Infinity']
-    if metadata['simd']:
+    if metadata['simd'] or settings['SIMD']: # Always import SIMD when building with -s SIMD=1, since in that mode memcpy is SIMD optimized.


same question here - doesn't one of those imply the other?

Here is the same, simd=1 refers to whether the input .bc code contained SIMD constructs in general or not, and SIMD=1 means whether user wants to target SIMD now.

Right, thanks, I forgot how those worked.

kripken · 2017-02-01T20:02:31Z

tests/test_benchmark.py

@@ -582,6 +583,31 @@ def output_parser(output):
      return 100.0/float(re.search('Unrolled Single  Precision +([\d\.]+) Mflops', output).group(1))
    self.do_benchmark('linpack_float', open(path_from_root('tests', 'linpack.c')).read(), '''Unrolled Single  Precision''', force_c=True, output_parser=output_parser, shared_args=['-DSP'])

+  def test_memcpy_128b(self):


please disable these by default (zzz_test_*, as it looks for test_). unless you think we should constantly monitor these?

Ok. Let me keep the 16k copy case, because that is where we are the slowest.

kripken · 2017-02-01T20:03:16Z

tests/test_core.py

@@ -2129,6 +2129,9 @@ def test_memcpy2(self):
  def test_memcpy3(self):
    self.do_run_in_out_file_test('tests', 'core', 'test_memcpy3')

+  def test_memcpy_alignment(self):
+    self.do_run(open(path_from_root('tests', 'test_memcpy.cpp'), 'r').read(), 'OK.')


maybe name the cpp file test_memcpy_alignment.cpp to match the test name?

kripken · 2017-02-01T20:03:54Z

lgtm with those things fixed.

…enchmarks where Emscripten is relatively good compared to native.

juj added performance SIMD labels Feb 25, 2016

juj mentioned this pull request Feb 29, 2016

Opt gluniform array upload #4128

Merged

Optimize memcpy() by reducing three arithmetic ops in loop body to tw…

9b55108

…o, by unrolling, and by using SIMD.

juj force-pushed the simd_memcpy branch from 1d33d29 to 9b55108 Compare February 1, 2017 07:33

kripken reviewed Feb 1, 2017

View reviewed changes

Rename test_memcpy.cpp to test_memcpy_alignment.cpp. Disable memcpy b…

c678622

…enchmarks where Emscripten is relatively good compared to native.

juj merged commit 7a84ba0 into emscripten-core:incoming Feb 1, 2017

This was referenced Feb 2, 2017

Do we need a memcpy opcode? WebAssembly/design#977

Closed

Implement 64-bit word sized memcpy #4915

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

simd_memcpy #4127

simd_memcpy #4127

Uh oh!

juj commented Feb 25, 2016

Uh oh!

juj commented Feb 25, 2016

Uh oh!

juj commented Feb 1, 2017

Uh oh!

juj commented Feb 1, 2017

Uh oh!

juj commented Feb 1, 2017

Uh oh!

juj commented Feb 1, 2017

Uh oh!

kripken Feb 1, 2017

Uh oh!

juj Feb 1, 2017

Uh oh!

kripken Feb 1, 2017

Uh oh!

juj Feb 1, 2017

Uh oh!

kripken Feb 1, 2017

Uh oh!

kripken Feb 1, 2017

Uh oh!

juj Feb 1, 2017

Uh oh!

kripken Feb 1, 2017

Uh oh!

juj Feb 1, 2017

Uh oh!

kripken commented Feb 1, 2017

Uh oh!

Uh oh!

simd_memcpy #4127

simd_memcpy #4127

Uh oh!

Conversation

juj commented Feb 25, 2016

Uh oh!

juj commented Feb 25, 2016

Uh oh!

juj commented Feb 1, 2017

Uh oh!

juj commented Feb 1, 2017

Uh oh!

juj commented Feb 1, 2017

Uh oh!

juj commented Feb 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken commented Feb 1, 2017

Uh oh!

Uh oh!