-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string_decoder: rewrite implementation #6777
Conversation
throw new Error('Unknown encoding: ' + encoding); | ||
// Do not cache `Buffer.isEncoding` when checking encoding names as some | ||
// modules monkey-patch it to support additional encodings | ||
function normalizeEncoding(enc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it would make sense to move this into an internal util
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. I was working on some Buffer changes that might end up doing something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, I had been experimenting with something similar in the speculative icu module work. See https://github.com/jasnell/node/blob/icu-module/lib/icu.js#L113-L134.
Overall there's nothing that stands out as being a concern here. I would note that the documentation for string_decoder leaves much to be desired but that can be fixed up later. If CI is green then LGTM |
// characters. | ||
exports.StringDecoder = StringDecoder; | ||
function StringDecoder(encoding) { | ||
this.encoding = normalizeEncoding(encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should default to utf8
when encoding
is empty/undefined/generally false-y, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed + test added.
CI is green, and LGTM if the fallback to |
e034539
to
21cc740
Compare
CI after a few nits were addressed: https://ci.nodejs.org/job/node-test-pull-request/2652/ |
21cc740
to
efd0a38
Compare
I had to change the UTF-8 implementation due to being unable to distinguish between explicit and implicit replacement characters. The new UTF-8 implementation now does incomplete character checking similar to the original implementation. UTF-8 performance numbers are still the same however. |
CI with the most recent changes: https://ci.nodejs.org/job/node-test-commit/3352/ |
@jasnell Does this still LGTY (after the changes I made to the UTF-8 implementation)? |
Yes. I'd like to see if we can make further improvements and optimizations to this but this LGTM |
@jasnell I'm not sure much more is possible for UTF-8 on the js-side of things. AFAIK the only way now to optimize UTF-8 further would be to try to skip v8's UTF-8 decoder and use a custom C/C++ UTF-8 to UTF-16 decoder and pass the resulting output to v8? I'm assuming v8 doesn't do any additional processing on two-byte strings (created via |
I'm being extra cautious and marking this as not for v4.x. If it's one thing I've learned, it's that node's tests aren't always up to snuff and may not (always) be good enough for checking rewrites like this. Let's let it stew in current branches first. |
ICU includes a number of highly optimized UTF8/UTF16 utilities that could potentially be leveraged. In fact, we might be able to drop down to a full native implementation of string_decoder that is based entirely on ICU's utilities. |
i = 0; | ||
} | ||
if (i < buf.length) | ||
return (r ? r + this.text(buf, i) : this.text(buf, i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: Removed, misread.
LGTM |
This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability. Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking. PR-URL: nodejs#6777 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
efd0a38
to
d23b7d2
Compare
One last CI before landing: https://ci.nodejs.org/job/node-test-pull-request/2855/ |
This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability. Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking. PR-URL: nodejs#6777 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability. Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking. PR-URL: #6777 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
See #7308 for some breakage apparently caused by this change here. |
Checklist
Affected core subsystem(s)
Description of change
This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability.
Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking.
Benchmark results: