-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Simd swap #44578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simd swap #44578
Conversation
Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512 instructions.
Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512 instructions.
Enabled SIMD for buffer.swap16|32|64. Currently using the fastest AVX512 instructions.
Review requested:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 on this for two main reasons:
- It's limited to a single, new SIMD instruction that isn't widespread yet
- I'm not convinced that swapping bytes is a common enough operation to warrant the level of optimization being made here
Additional notes:
- It's missing a license and the changes in tools/ for including said license if this is a true 3rd party dependency
- Using appropriate pragmas instead of compiler flags is the way to go (as was did with deps/zlib)
The code is not from 3rd party like Base64 or zlib, so there is no 3rd party license. This patch is to optimize the util.h itself to make bit swap faster. Bit swap is useful in endian conversion and network scenarios. For your first comment, I'm not sure if Node.js project has a criteria for platform dependent code. Actually AVX512vbmi is an "old" instruction born 4 years ago, and widespread in e.g. AWS cloud instances nowadays. In another hand this is a performance optimization that would not downgrade performance on platforms not supporting AVX512vbmi. If you advised to support more SIMD instructions like AVX512f or AVX2, I think it OK to add them later. |
Right, but I don't see that as a common operation that would be a bottleneck in node applications. |
Is there criteria for performance PR should be? In general I'm looking for parallel programming opportunities at any aspect to make Node.js runs at best performance. |
Here are a few general suggestions:
Offhand if I had to pick a specific feature that would be worth optimizing yet it would be supporting accelerated base64 encoding with the base64url alphabet or base64 decoding. Arguably these two would need to be (perhaps at least partially) supported upstream by the recently added base64 library, so if you want to tackle either of those you may want to start there. |
I already submitted the PR to improve Base64 encoding in the new 3rd party base64-simd project although Base64 decoding has no feasible way to do due to the special empty chars allowed by Node.js base64. I'm not sure why base64 meet the criteria. It is also a new added dependency and base64 usage scope is similar as byte swap. Is there complaint for base64 has performance issue in real world workloads? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar concerns. Adding new things to deps
is a tough decision, and I don't expect swapX()
to be used much, especially since these do not exist on Uint8Array
.
inline static void | ||
swap_simd (char* data, size_t* nbytes, size_t size) { | ||
if(!support_simd()) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something crucial here. How is this supposed to work? It seems like swap_simd
may or may not actually swap data, and the caller seems unable to tell what happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is that, if current platform does not support AVX512 the swap_simd returns directly and inside util-inl.h the swap16() falls into latency swapping.
#else // only support X64 | ||
inline static void | ||
swap_simd (char* data, size_t* nbytes, size_t size) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here. In this case, swap_simd
does not do anything, but in other cases, it might, without the caller being aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if current platform does not support AVX512 the swap_simd returns directly and inside util-inl.h the swap16() falls into latency swapping.
src/util-inl.h
Outdated
@@ -216,6 +217,8 @@ inline v8::Local<v8::String> OneByteString(v8::Isolate* isolate, | |||
void SwapBytes16(char* data, size_t nbytes) { | |||
CHECK_EQ(nbytes % 2, 0); | |||
|
|||
::swap_simd(data, &nbytes, 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I might be missing something. Why do these calls fall through to the old implementation? Should it not return here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swap_simd only process n16 bytes in this case. When swap_simd returns, the data and nbytes have been updated and the working flow goes into below legacy code to process remained bytes. (nbytes-n16) bytes
I want to second this suggestion. The way I approach it is by profiling things like Express apps, create-react-app, the TypeScript compiler, etc. and follow where the numbers lead me. |
According to your comments, if I would like to contribute this optimization, would I add the simd_swap logic into util-inl.h directly instead of add a new deps folder? The there would be no concern of affecting deps folder, and simd_swap is just a pre-processor of buffers to be swapped. |
I agree that using real workloads is also good to improve Node.js performance. I'm not against that and I'm working on real workloads like Ghost, micro services too. The guidance I followed in my performance work is that every micro bench is equally important as long as it exists in Node.js benchmark. I believe the purpose of maintaining benchmark by Node.js repo is to evaluate performance of each module and in low level instead of using big real workloads directly. Node.js is a programming platform and users can invoke every APIs at present and in future development. We cannot predict some API like buffer.swap64() would be less in use than APIs without limiting the users. |
Don't attach too much weight to the benchmarks in the repo. Their primary function is to catch performance regressions. That they can be used to show performance improvements is just a nice secondary function. |
Used SIMD to improve bufferswap performance including below APIs:
According to Node.js buffer API design, only the buffer over 128 bytes go into C++ level implementations. This optimization only works for this case. The optimization takes effect on X86 platforms that support AVX512vbmi instructions, e.g. Cannolake, Icelake, Icelale based AWS EC2 instances.
The code is placed in deps/ folder as platform dependent.
To achieve the best performance, it adopts AVX512 permute instruction and uses inlining instead of function call because the SIMD logic is too short.
The node.js buffer swap benchmark shows big improvements after applying this optimization which vary from ~30% to ~600%.