Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: optimize buffer transcode for matching encodings #54507

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JonasBa
Copy link
Contributor

@JonasBa JonasBa commented Aug 22, 2024

If the buffer encodings match, we can just perform a copy of the buffer

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. needs-ci PRs that need a full CI run. labels Aug 22, 2024
@anonrig
Copy link
Member

anonrig commented Aug 22, 2024

Benchmark ci: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1614/

Results:

                                                                                           confidence improvement accuracy (*)   (**)  (***)
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ascii' fromEncoding='ascii'             ***     48.80 %       ±1.04% ±1.39% ±1.82%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ascii' fromEncoding='latin1'                    -0.00 %       ±0.58% ±0.78% ±1.02%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ascii' fromEncoding='ucs2'                       0.77 %       ±1.03% ±1.37% ±1.79%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ascii' fromEncoding='utf8'                       0.17 %       ±0.66% ±0.88% ±1.14%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='latin1' fromEncoding='ascii'             **      0.96 %       ±0.69% ±0.91% ±1.19%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='latin1' fromEncoding='latin1'           ***     50.83 %       ±0.90% ±1.21% ±1.58%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='latin1' fromEncoding='ucs2'                      0.40 %       ±0.72% ±0.96% ±1.25%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='latin1' fromEncoding='utf8'              **      0.92 %       ±0.65% ±0.86% ±1.12%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ucs2' fromEncoding='ascii'                      -0.49 %       ±0.73% ±0.97% ±1.26%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ucs2' fromEncoding='latin1'                      0.06 %       ±0.93% ±1.23% ±1.60%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ucs2' fromEncoding='ucs2'               ***     52.05 %       ±0.87% ±1.17% ±1.53%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ucs2' fromEncoding='utf8'                 *      0.78 %       ±0.72% ±0.97% ±1.26%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='utf8' fromEncoding='ascii'                       0.33 %       ±0.59% ±0.79% ±1.02%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='utf8' fromEncoding='latin1'              **     -0.99 %       ±0.71% ±0.95% ±1.23%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='utf8' fromEncoding='utf8'               ***     20.36 %       ±0.89% ±1.18% ±1.53%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ascii' fromEncoding='ascii'            ***     51.07 %       ±0.93% ±1.24% ±1.63%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ascii' fromEncoding='latin1'             *      0.65 %       ±0.62% ±0.82% ±1.07%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ascii' fromEncoding='ucs2'             ***      1.34 %       ±0.60% ±0.80% ±1.04%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ascii' fromEncoding='utf8'              **      0.95 %       ±0.62% ±0.82% ±1.07%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='latin1' fromEncoding='ascii'           ***      1.31 %       ±0.67% ±0.89% ±1.17%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='latin1' fromEncoding='latin1'          ***     52.41 %       ±1.09% ±1.46% ±1.90%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='latin1' fromEncoding='ucs2'            ***      1.75 %       ±0.52% ±0.69% ±0.90%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='latin1' fromEncoding='utf8'                     0.25 %       ±0.88% ±1.17% ±1.52%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ucs2' fromEncoding='ascii'                      0.40 %       ±0.80% ±1.06% ±1.39%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ucs2' fromEncoding='latin1'                     0.17 %       ±0.78% ±1.04% ±1.36%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ucs2' fromEncoding='ucs2'              ***     50.24 %       ±1.00% ±1.33% ±1.73%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ucs2' fromEncoding='utf8'               **     -1.02 %       ±0.70% ±0.94% ±1.22%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='utf8' fromEncoding='ascii'               *      0.79 %       ±0.77% ±1.03% ±1.34%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='utf8' fromEncoding='latin1'                    -0.14 %       ±0.57% ±0.76% ±0.99%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='utf8' fromEncoding='utf8'              ***     22.81 %       ±0.91% ±1.21% ±1.58%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ascii' fromEncoding='ascii'          ***     81.85 %       ±1.32% ±1.77% ±2.34%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ascii' fromEncoding='latin1'                  0.16 %       ±0.57% ±0.76% ±0.99%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ascii' fromEncoding='ucs2'           ***      0.41 %       ±0.08% ±0.10% ±0.13%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ascii' fromEncoding='utf8'                    0.38 %       ±0.58% ±0.77% ±1.01%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='latin1' fromEncoding='ascii'         ***      1.61 %       ±0.49% ±0.65% ±0.84%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='latin1' fromEncoding='latin1'        ***     59.96 %       ±0.86% ±1.14% ±1.50%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='latin1' fromEncoding='ucs2'          ***      0.34 %       ±0.10% ±0.14% ±0.18%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='latin1' fromEncoding='utf8'            *      0.43 %       ±0.36% ±0.48% ±0.63%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ucs2' fromEncoding='ascii'                   -0.16 %       ±0.53% ±0.70% ±0.91%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ucs2' fromEncoding='latin1'            *     -0.70 %       ±0.55% ±0.74% ±0.96%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ucs2' fromEncoding='ucs2'            ***    157.60 %       ±1.76% ±2.37% ±3.15%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ucs2' fromEncoding='utf8'                    -0.02 %       ±0.50% ±0.66% ±0.87%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='utf8' fromEncoding='ascii'           ***     -0.66 %       ±0.34% ±0.46% ±0.60%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='utf8' fromEncoding='latin1'                   0.10 %       ±0.78% ±1.03% ±1.34%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='utf8' fromEncoding='utf8'            ***     96.34 %       ±1.28% ±1.71% ±2.26%

@JonasBa JonasBa force-pushed the jb/buffer/transcode branch from 2c6ebcb to 9af9206 Compare August 22, 2024 19:29
@anonrig anonrig added author ready PRs that have at least one approval, no pending requests for changes, and a CI started. request-ci Add this label to start a Jenkins CI on a PR. labels Aug 22, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Aug 22, 2024
@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig requested a review from lemire August 22, 2024 20:09
@nodejs-github-bot

This comment was marked as outdated.

@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change:

$ node -p "buffer.transcode(Buffer.from([0x80]), 'utf8', 'utf8')" 
<Buffer ef bf bd>
$ ./node -p "buffer.transcode(Buffer.from([0x80]), 'utf8', 'utf8')" 
<Buffer 80>

if this is intentional, it needs to come with a doc update and tests. That being said, it's probably the most consistent/least unexpected to keep the existing behavior – in that case, this optimization can only be applied if the input is known to be valid.

@addaleax addaleax removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Aug 24, 2024
@anonrig
Copy link
Member

anonrig commented Aug 24, 2024

@addaleax So, if the input is invalid, it replaces invalid characters with replacement characters, even if from and to encodings are same. Is there an encoding type (or plural) that doesn't conform to this expectation?

@addaleax
Copy link
Member

@anonrig I'd be surprised if there were (you may just want to check manually to be sure), but obviously the replacement character itself differs between encodings. But yeah, the point is that buffer.transcode(src, from, to) should always return a buffer that is valid in the to encoding.

@anonrig
Copy link
Member

anonrig commented Aug 24, 2024

@addaleax @JonasBa so, technically we can check for utf16 and utf8 validity through simdutf and just return a copy if it's valid and still speed up utf8 and utf16 transcode.

@lemire
Copy link
Member

lemire commented Aug 25, 2024

@anonrig We shall add transcoding with replacement to simdutf in the future.

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed out already, a check is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants