Simd swap #44578

lucshi · 2022-09-09T02:16:58Z

Used SIMD to improve bufferswap performance including below APIs:

buffer.swap16()
buffer.swap32()
buffer.swap64()

According to Node.js buffer API design, only the buffer over 128 bytes go into C++ level implementations. This optimization only works for this case. The optimization takes effect on X86 platforms that support AVX512vbmi instructions, e.g. Cannolake, Icelake, Icelale based AWS EC2 instances.

The code is placed in deps/ folder as platform dependent.

To achieve the best performance, it adopts AVX512 permute instruction and uses inlining instead of function call because the SIMD logic is too short.

The node.js buffer swap benchmark shows big improvements after applying this optimization which vary from ~30% to ~600%.

node-benchmark-compare 
                                                                          confidence improvement accuracy (*)     (**)    (***)
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='false'        ***     70.56 %       ±3.32%   ±4.56%   ±6.22%
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='true'         ***     60.85 %       ±3.78%   ±5.19%   ±7.09%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='false'        ***    254.48 %       ±3.07%   ±4.40%   ±6.44%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='true'         ***    254.36 %       ±2.71%   ±3.88%   ±5.68%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='false'        ***    132.66 %       ±3.45%   ±4.95%   ±7.27%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='true'         ***    135.83 %       ±4.25%   ±6.09%   ±8.91%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='false'        ***     93.81 %       ±2.70%   ±3.71%   ±5.06%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='true'         ***     85.19 %       ±3.74%   ±5.26%   ±7.45%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='false'        ***    346.51 %       ±4.85%   ±6.90%   ±9.98%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='true'         ***    334.75 %       ±7.03%  ±10.10%  ±14.84%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='false'        ***    200.37 %       ±3.26%   ±4.68%   ±6.88%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='true'         ***    202.17 %       ±4.13%   ±5.94%   ±8.73%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='false'         ***     32.23 %       ±6.39%   ±9.04%  ±12.94%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='true'          ***     27.76 %       ±3.85%   ±5.31%   ±7.29%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='false'         ***     82.32 %       ±4.38%   ±6.08%   ±8.44%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='true'          ***     82.85 %      ±12.90%  ±18.01%  ±25.27%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='false'         ***     42.91 %       ±2.00%   ±2.74%   ±3.75%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='true'          ***     44.30 %       ±2.94%   ±4.03%   ±5.50%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='false'                  -0.01 %       ±0.35%   ±0.48%   ±0.66%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='true'                   -0.05 %       ±0.19%   ±0.26%   ±0.36%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='false'                   0.38 %       ±1.76%   ±2.42%   ±3.30%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='true'                    0.72 %       ±2.06%   ±2.82%   ±3.85%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='false'                   0.44 %       ±1.00%   ±1.37%   ±1.86%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='true'                   -0.36 %       ±0.48%   ±0.66%   ±0.90%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='false'         ***     62.04 %       ±5.14%   ±7.21%  ±10.20%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='true'          ***     50.11 %       ±4.66%   ±6.43%   ±8.86%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='false'         ***    185.22 %       ±5.40%   ±7.75%  ±11.38%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='true'          ***    186.12 %       ±2.80%   ±3.94%   ±5.59%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='false'         ***    107.24 %       ±2.48%   ±3.56%   ±5.24%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='true'          ***    109.04 %       ±5.19%   ±7.45%  ±10.96%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='false'        ***    104.20 %       ±2.08%   ±2.85%   ±3.89%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='true'         ***    176.63 %      ±58.59%  ±84.16% ±123.81%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='false'        ***    450.33 %       ±4.58%   ±6.57%   ±9.65%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='true'         ***    618.62 %     ±161.33% ±231.77% ±340.97%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='false'        ***    264.54 %       ±3.58%   ±5.15%   ±7.58%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='true'         ***    384.05 %     ±110.97% ±159.42% ±234.52%

Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512 instructions.

nodejs-github-bot · 2022-09-09T02:17:02Z

Review requested:

@nodejs/gyp

mscdex

I'm -1 on this for two main reasons:

It's limited to a single, new SIMD instruction that isn't widespread yet
I'm not convinced that swapping bytes is a common enough operation to warrant the level of optimization being made here

Additional notes:

It's missing a license and the changes in tools/ for including said license if this is a true 3rd party dependency
Using appropriate pragmas instead of compiler flags is the way to go (as was did with deps/zlib)

lucshi · 2022-09-09T05:51:18Z

I'm -1 on this for two main reasons:

It's limited to a single, new SIMD instruction that isn't widespread yet

I'm not convinced that swapping bytes is a common enough operation to warrant the level of optimization being made here

Additional notes:

It's missing a license and the changes in tools/ for including said license if this is a true 3rd party dependency

Using appropriate pragmas instead of compiler flags is the way to go (as was did with deps/zlib)

The code is not from 3rd party like Base64 or zlib, so there is no 3rd party license. This patch is to optimize the util.h itself to make bit swap faster. Bit swap is useful in endian conversion and network scenarios.

For your first comment, I'm not sure if Node.js project has a criteria for platform dependent code. Actually AVX512vbmi is an "old" instruction born 4 years ago, and widespread in e.g. AWS cloud instances nowadays. In another hand this is a performance optimization that would not downgrade performance on platforms not supporting AVX512vbmi. If you advised to support more SIMD instructions like AVX512f or AVX2, I think it OK to add them later.

mscdex · 2022-09-09T07:00:44Z

Bit swap is useful in endian conversion

Right, but I don't see that as a common operation that would be a bottleneck in node applications.

lucshi · 2022-09-09T07:05:38Z

Bit swap is useful in endian conversion

Right, but I don't see that as a common operation that would be a bottleneck in node applications.

Is there criteria for performance PR should be? In general I'm looking for parallel programming opportunities at any aspect to make Node.js runs at best performance.

mscdex · 2022-09-09T07:24:29Z

Is there criteria for performance PR should be? In general I'm looking for parallel programming opportunities at any aspect to make Node.js runs at best performance.

Here are a few general suggestions:

Look for any performance-related TODOs throughout the node codebase
Consider submitting performance improvements for node's existing dependencies -- not only will that potentially benefit node, but perhaps other users as well
Research common/popular use cases for node, break down the potential bottlenecks (e.g. network I/O, thread pool consumption, CPU usage, JS<->C++ boundary crossing, etc.) that can arise from those use cases, and see if any of them can be improved by changing the relevant parts in node

Offhand if I had to pick a specific feature that would be worth optimizing yet it would be supporting accelerated base64 encoding with the base64url alphabet or base64 decoding. Arguably these two would need to be (perhaps at least partially) supported upstream by the recently added base64 library, so if you want to tackle either of those you may want to start there.

lucshi · 2022-09-09T07:32:03Z

Offhand if I had to pick a specific feature that would be worth optimizing yet it would be supporting accelerated base64 encoding with the base64url alphabet or base64 decoding. Arguably these two would need to be (perhaps at least partially) supported upstream by the recently added base64 library, so if you want to tackle either of those you may want to start there.

I already submitted the PR to improve Base64 encoding in the new 3rd party base64-simd project although Base64 decoding has no feasible way to do due to the special empty chars allowed by Node.js base64.

I'm not sure why base64 meet the criteria. It is also a new added dependency and base64 usage scope is similar as byte swap. Is there complaint for base64 has performance issue in real world workloads?

tniessen

I have similar concerns. Adding new things to deps is a tough decision, and I don't expect swapX() to be used much, especially since these do not exist on Uint8Array.

src/util-inl.h

tniessen · 2022-09-09T09:06:40Z

deps/bufferswap/include/libbufferswap.h

+inline static void
+swap_simd (char* data, size_t* nbytes, size_t size) {
+  if(!support_simd())
+    return;


I might be missing something crucial here. How is this supposed to work? It seems like swap_simd may or may not actually swap data, and the caller seems unable to tell what happened.

The logic is that, if current platform does not support AVX512 the swap_simd returns directly and inside util-inl.h the swap16() falls into latency swapping.

tniessen · 2022-09-09T09:07:28Z

deps/bufferswap/include/libbufferswap.h

+#else // only support X64
+  inline static void
+  swap_simd (char* data, size_t* nbytes, size_t size) {
+    return;


Similarly here. In this case, swap_simd does not do anything, but in other cases, it might, without the caller being aware.

if current platform does not support AVX512 the swap_simd returns directly and inside util-inl.h the swap16() falls into latency swapping.

tniessen · 2022-09-09T09:07:40Z

src/util-inl.h

@@ -216,6 +217,8 @@ inline v8::Local<v8::String> OneByteString(v8::Isolate* isolate,
 void SwapBytes16(char* data, size_t nbytes) {
  CHECK_EQ(nbytes % 2, 0);

+  ::swap_simd(data, &nbytes, 16);


Again, I might be missing something. Why do these calls fall through to the old implementation? Should it not return here?

swap_simd only process n16 bytes in this case. When swap_simd returns, the data and nbytes have been updated and the working flow goes into below legacy code to process remained bytes. (nbytes-n16) bytes

bnoordhuis · 2022-09-09T20:52:54Z

Research common/popular use cases for node, break down the potential bottlenecks [..] and see if any of them can be improved by changing the relevant parts in node

I want to second this suggestion. The way I approach it is by profiling things like Express apps, create-react-app, the TypeScript compiler, etc. and follow where the numbers lead me.

lucshi · 2022-09-10T02:22:42Z

I have similar concerns. Adding new things to deps is a tough decision, and I don't expect swapX() to be used much, especially since these do not exist on Uint8Array.

According to your comments, if I would like to contribute this optimization, would I add the simd_swap logic into util-inl.h directly instead of add a new deps folder? The there would be no concern of affecting deps folder, and simd_swap is just a pre-processor of buffers to be swapped.

lucshi · 2022-09-10T02:35:56Z

Research common/popular use cases for node, break down the potential bottlenecks [..] and see if any of them can be improved by changing the relevant parts in node

I want to second this suggestion. The way I approach it is by profiling things like Express apps, create-react-app, the TypeScript compiler, etc. and follow where the numbers lead me.

I agree that using real workloads is also good to improve Node.js performance. I'm not against that and I'm working on real workloads like Ghost, micro services too. The guidance I followed in my performance work is that every micro bench is equally important as long as it exists in Node.js benchmark. I believe the purpose of maintaining benchmark by Node.js repo is to evaluate performance of each module and in low level instead of using big real workloads directly. Node.js is a programming platform and users can invoke every APIs at present and in future development. We cannot predict some API like buffer.swap64() would be less in use than APIs without limiting the users.

bnoordhuis · 2022-09-10T09:08:26Z

Don't attach too much weight to the benchmarks in the repo. Their primary function is to catch performance regressions. That they can be used to show performance improvements is just a nice secondary function.

…swap

lucshi · 2022-09-17T03:46:02Z

Changing this patch after tasking all the considerations from review comments:

Patch is too big that add a deps folder -> I made all the code changes into the util-inl.h file, and did not touch any other folders and files.
Patch only adopted AVX512 Simd instruction which is not widespread -> I enabled ssse version and AVX512 version at the same time. So that all popular platforms can benefit from this patch. Most of known platforms support ssse
Do not modify the gyp file -> I removed all code changes in gyp file, and uses attribute in the source code file like deps/zlib does.

For all platforms, the most performance gains is >5X.

For the platforms supporting AVX512, it achieved best performance gain, the charts for comparison as below:

For the platforms supporting SSE only, it also achieved good performance, charts as below:

lucshi added 11 commits September 4, 2022 09:07

deps, src: enabled SIMD for buffer swapping

fa071f1

Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512 instructions.

update simd buffer swap

c959b82

update

dd4de56

final update

f744791

deps, src: enabled SIMD for buffer swapping

b61b541

Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512 instructions.

update simd buffer swap

5b7dc07

update

e95589c

final update

0042bb5

update readme

a4007cb

update reademe

7c1278b

deps, src: enabled SIMD for buffer swapping

1575c0f

Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512 instructions.

nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Sep 9, 2022

mscdex suggested changes Sep 9, 2022

View reviewed changes

tniessen reviewed Sep 9, 2022

View reviewed changes

lucshi added 5 commits September 16, 2022 22:35

remove deps and implemented in header files directly.

34b1aa8

Merge branch 'nodejs:main' into simd-swap

6048cbb

restore node.gyp to upstream version.

1d9ee0c

Merge branch 'simd-swap' of https://github.com/lucshi/node into simd-…

aff48f4

…swap

c++ check

460e5c3

lucshi added 9 commits September 17, 2022 21:56

fix mac building issue.

5c03fd9

fix c++ lint

12cf4b6

fix c++ lint

6b4a524

fix sse type cast

281ae1b

fix non gun building

e17cc29

cpp lint

e40f91d

fix non gnu cpuid bug

882d09d

update to fix the data pointer bug

97d47af

clint fix

5e6a26f

lucshi closed this Sep 26, 2022

lucshi deleted the simd-swap branch September 26, 2022 07:38

lucshi mentioned this pull request Sep 26, 2022

src: enable SIMD support for buffer swap #44793

Closed

Uh oh!

Simd swap #44578

Simd swap #44578

Uh oh!

Conversation

lucshi commented Sep 9, 2022

Uh oh!

nodejs-github-bot commented Sep 9, 2022

Uh oh!

mscdex left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucshi commented Sep 9, 2022

Uh oh!

mscdex commented Sep 9, 2022

Uh oh!

lucshi commented Sep 9, 2022

Uh oh!

mscdex commented Sep 9, 2022

Uh oh!

lucshi commented Sep 9, 2022

Uh oh!

tniessen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tniessen Sep 9, 2022

Choose a reason for hiding this comment

Uh oh!

lucshi Sep 10, 2022

Choose a reason for hiding this comment

Uh oh!

tniessen Sep 9, 2022

Choose a reason for hiding this comment

Uh oh!

lucshi Sep 10, 2022

Choose a reason for hiding this comment

Uh oh!

tniessen Sep 9, 2022

Choose a reason for hiding this comment

Uh oh!

lucshi Sep 10, 2022

Choose a reason for hiding this comment

Uh oh!

bnoordhuis commented Sep 9, 2022

Uh oh!

lucshi commented Sep 10, 2022

Uh oh!

lucshi commented Sep 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnoordhuis commented Sep 10, 2022

Uh oh!

lucshi commented Sep 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mscdex left a comment •

edited

Loading

lucshi commented Sep 10, 2022 •

edited

Loading

lucshi commented Sep 17, 2022 •

edited

Loading