Skip to content

Simd swap #44578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from
Closed

Simd swap #44578

wants to merge 25 commits into from

Conversation

lucshi
Copy link

@lucshi lucshi commented Sep 9, 2022

Used SIMD to improve bufferswap performance including below APIs:

buffer.swap16()
buffer.swap32()
buffer.swap64()

According to Node.js buffer API design, only the buffer over 128 bytes go into C++ level implementations. This optimization only works for this case. The optimization takes effect on X86 platforms that support AVX512vbmi instructions, e.g. Cannolake, Icelake, Icelale based AWS EC2 instances.

The code is placed in deps/ folder as platform dependent.

To achieve the best performance, it adopts AVX512 permute instruction and uses inlining instead of function call because the SIMD logic is too short.

The node.js buffer swap benchmark shows big improvements after applying this optimization which vary from ~30% to ~600%.

node-benchmark-compare 
                                                                          confidence improvement accuracy (*)     (**)    (***)
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='false'        ***     70.56 %       ±3.32%   ±4.56%   ±6.22%
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='true'         ***     60.85 %       ±3.78%   ±5.19%   ±7.09%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='false'        ***    254.48 %       ±3.07%   ±4.40%   ±6.44%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='true'         ***    254.36 %       ±2.71%   ±3.88%   ±5.68%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='false'        ***    132.66 %       ±3.45%   ±4.95%   ±7.27%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='true'         ***    135.83 %       ±4.25%   ±6.09%   ±8.91%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='false'        ***     93.81 %       ±2.70%   ±3.71%   ±5.06%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='true'         ***     85.19 %       ±3.74%   ±5.26%   ±7.45%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='false'        ***    346.51 %       ±4.85%   ±6.90%   ±9.98%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='true'         ***    334.75 %       ±7.03%  ±10.10%  ±14.84%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='false'        ***    200.37 %       ±3.26%   ±4.68%   ±6.88%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='true'         ***    202.17 %       ±4.13%   ±5.94%   ±8.73%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='false'         ***     32.23 %       ±6.39%   ±9.04%  ±12.94%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='true'          ***     27.76 %       ±3.85%   ±5.31%   ±7.29%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='false'         ***     82.32 %       ±4.38%   ±6.08%   ±8.44%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='true'          ***     82.85 %      ±12.90%  ±18.01%  ±25.27%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='false'         ***     42.91 %       ±2.00%   ±2.74%   ±3.75%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='true'          ***     44.30 %       ±2.94%   ±4.03%   ±5.50%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='false'                  -0.01 %       ±0.35%   ±0.48%   ±0.66%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='true'                   -0.05 %       ±0.19%   ±0.26%   ±0.36%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='false'                   0.38 %       ±1.76%   ±2.42%   ±3.30%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='true'                    0.72 %       ±2.06%   ±2.82%   ±3.85%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='false'                   0.44 %       ±1.00%   ±1.37%   ±1.86%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='true'                   -0.36 %       ±0.48%   ±0.66%   ±0.90%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='false'         ***     62.04 %       ±5.14%   ±7.21%  ±10.20%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='true'          ***     50.11 %       ±4.66%   ±6.43%   ±8.86%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='false'         ***    185.22 %       ±5.40%   ±7.75%  ±11.38%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='true'          ***    186.12 %       ±2.80%   ±3.94%   ±5.59%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='false'         ***    107.24 %       ±2.48%   ±3.56%   ±5.24%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='true'          ***    109.04 %       ±5.19%   ±7.45%  ±10.96%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='false'        ***    104.20 %       ±2.08%   ±2.85%   ±3.89%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='true'         ***    176.63 %      ±58.59%  ±84.16% ±123.81%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='false'        ***    450.33 %       ±4.58%   ±6.57%   ±9.65%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='true'         ***    618.62 %     ±161.33% ±231.77% ±340.97%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='false'        ***    264.54 %       ±3.58%   ±5.15%   ±7.58%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='true'         ***    384.05 %     ±110.97% ±159.42% ±234.52%

lucshi added 11 commits September 4, 2022 09:07
Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512
instructions.
Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512
instructions.
Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512
instructions.
@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/gyp

@nodejs-github-bot nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Sep 9, 2022
Copy link
Contributor

@mscdex mscdex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on this for two main reasons:

  1. It's limited to a single, new SIMD instruction that isn't widespread yet
  2. I'm not convinced that swapping bytes is a common enough operation to warrant the level of optimization being made here

Additional notes:

  • It's missing a license and the changes in tools/ for including said license if this is a true 3rd party dependency
  • Using appropriate pragmas instead of compiler flags is the way to go (as was did with deps/zlib)

@lucshi
Copy link
Author

lucshi commented Sep 9, 2022

I'm -1 on this for two main reasons:

  1. It's limited to a single, new SIMD instruction that isn't widespread yet
  2. I'm not convinced that swapping bytes is a common enough operation to warrant the level of optimization being made here

Additional notes:

  • It's missing a license and the changes in tools/ for including said license if this is a true 3rd party dependency
  • Using appropriate pragmas instead of compiler flags is the way to go (as was did with deps/zlib)

The code is not from 3rd party like Base64 or zlib, so there is no 3rd party license. This patch is to optimize the util.h itself to make bit swap faster. Bit swap is useful in endian conversion and network scenarios.

For your first comment, I'm not sure if Node.js project has a criteria for platform dependent code. Actually AVX512vbmi is an "old" instruction born 4 years ago, and widespread in e.g. AWS cloud instances nowadays. In another hand this is a performance optimization that would not downgrade performance on platforms not supporting AVX512vbmi. If you advised to support more SIMD instructions like AVX512f or AVX2, I think it OK to add them later.

@mscdex
Copy link
Contributor

mscdex commented Sep 9, 2022

Bit swap is useful in endian conversion

Right, but I don't see that as a common operation that would be a bottleneck in node applications.

@lucshi
Copy link
Author

lucshi commented Sep 9, 2022

Bit swap is useful in endian conversion

Right, but I don't see that as a common operation that would be a bottleneck in node applications.

Is there criteria for performance PR should be? In general I'm looking for parallel programming opportunities at any aspect to make Node.js runs at best performance.

@mscdex
Copy link
Contributor

mscdex commented Sep 9, 2022

Is there criteria for performance PR should be? In general I'm looking for parallel programming opportunities at any aspect to make Node.js runs at best performance.

Here are a few general suggestions:

  • Look for any performance-related TODOs throughout the node codebase
  • Consider submitting performance improvements for node's existing dependencies -- not only will that potentially benefit node, but perhaps other users as well
  • Research common/popular use cases for node, break down the potential bottlenecks (e.g. network I/O, thread pool consumption, CPU usage, JS<->C++ boundary crossing, etc.) that can arise from those use cases, and see if any of them can be improved by changing the relevant parts in node

Offhand if I had to pick a specific feature that would be worth optimizing yet it would be supporting accelerated base64 encoding with the base64url alphabet or base64 decoding. Arguably these two would need to be (perhaps at least partially) supported upstream by the recently added base64 library, so if you want to tackle either of those you may want to start there.

@lucshi
Copy link
Author

lucshi commented Sep 9, 2022

Offhand if I had to pick a specific feature that would be worth optimizing yet it would be supporting accelerated base64 encoding with the base64url alphabet or base64 decoding. Arguably these two would need to be (perhaps at least partially) supported upstream by the recently added base64 library, so if you want to tackle either of those you may want to start there.

I already submitted the PR to improve Base64 encoding in the new 3rd party base64-simd project although Base64 decoding has no feasible way to do due to the special empty chars allowed by Node.js base64.

I'm not sure why base64 meet the criteria. It is also a new added dependency and base64 usage scope is similar as byte swap. Is there complaint for base64 has performance issue in real world workloads?

Copy link
Member

@tniessen tniessen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar concerns. Adding new things to deps is a tough decision, and I don't expect swapX() to be used much, especially since these do not exist on Uint8Array.

inline static void
swap_simd (char* data, size_t* nbytes, size_t size) {
if(!support_simd())
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something crucial here. How is this supposed to work? It seems like swap_simd may or may not actually swap data, and the caller seems unable to tell what happened.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is that, if current platform does not support AVX512 the swap_simd returns directly and inside util-inl.h the swap16() falls into latency swapping.

#else // only support X64
inline static void
swap_simd (char* data, size_t* nbytes, size_t size) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. In this case, swap_simd does not do anything, but in other cases, it might, without the caller being aware.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if current platform does not support AVX512 the swap_simd returns directly and inside util-inl.h the swap16() falls into latency swapping.

src/util-inl.h Outdated
@@ -216,6 +217,8 @@ inline v8::Local<v8::String> OneByteString(v8::Isolate* isolate,
void SwapBytes16(char* data, size_t nbytes) {
CHECK_EQ(nbytes % 2, 0);

::swap_simd(data, &nbytes, 16);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I might be missing something. Why do these calls fall through to the old implementation? Should it not return here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swap_simd only process n16 bytes in this case. When swap_simd returns, the data and nbytes have been updated and the working flow goes into below legacy code to process remained bytes. (nbytes-n16) bytes

@bnoordhuis
Copy link
Member

Research common/popular use cases for node, break down the potential bottlenecks [..] and see if any of them can be improved by changing the relevant parts in node

I want to second this suggestion. The way I approach it is by profiling things like Express apps, create-react-app, the TypeScript compiler, etc. and follow where the numbers lead me.

@lucshi
Copy link
Author

lucshi commented Sep 10, 2022

I have similar concerns. Adding new things to deps is a tough decision, and I don't expect swapX() to be used much, especially since these do not exist on Uint8Array.

According to your comments, if I would like to contribute this optimization, would I add the simd_swap logic into util-inl.h directly instead of add a new deps folder? The there would be no concern of affecting deps folder, and simd_swap is just a pre-processor of buffers to be swapped.

@lucshi
Copy link
Author

lucshi commented Sep 10, 2022

Research common/popular use cases for node, break down the potential bottlenecks [..] and see if any of them can be improved by changing the relevant parts in node

I want to second this suggestion. The way I approach it is by profiling things like Express apps, create-react-app, the TypeScript compiler, etc. and follow where the numbers lead me.

I agree that using real workloads is also good to improve Node.js performance. I'm not against that and I'm working on real workloads like Ghost, micro services too. The guidance I followed in my performance work is that every micro bench is equally important as long as it exists in Node.js benchmark. I believe the purpose of maintaining benchmark by Node.js repo is to evaluate performance of each module and in low level instead of using big real workloads directly. Node.js is a programming platform and users can invoke every APIs at present and in future development. We cannot predict some API like buffer.swap64() would be less in use than APIs without limiting the users.

@bnoordhuis
Copy link
Member

Don't attach too much weight to the benchmarks in the repo. Their primary function is to catch performance regressions. That they can be used to show performance improvements is just a nice secondary function.

@lucshi
Copy link
Author

lucshi commented Sep 17, 2022

Changing this patch after tasking all the considerations from review comments:

  1. Patch is too big that add a deps folder -> I made all the code changes into the util-inl.h file, and did not touch any other folders and files.
  2. Patch only adopted AVX512 Simd instruction which is not widespread -> I enabled ssse version and AVX512 version at the same time. So that all popular platforms can benefit from this patch. Most of known platforms support ssse
  3. Do not modify the gyp file -> I removed all code changes in gyp file, and uses attribute in the source code file like deps/zlib does.

For all platforms, the most performance gains is >5X.

For the platforms supporting AVX512, it achieved best performance gain, the charts for comparison as below:
image

For the platforms supporting SSE only, it also achieved good performance, charts as below:
image

@lucshi lucshi closed this Sep 26, 2022
@lucshi lucshi deleted the simd-swap branch September 26, 2022 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants