Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string_decoder: rewrite implementation #6777

Merged
merged 1 commit into from
May 29, 2016

Conversation

mscdex
Copy link
Contributor

@mscdex mscdex commented May 15, 2016

Checklist
  • tests and code linting passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)
  • string_decoder
Description of change

This commit provides a rewrite of StringDecoder that both improves performance (for non-single-byte encodings) and understandability.

Additionally, StringDecoder instantiation performance has increased considerably due to inlinability and more efficient encoding name checking.

Benchmark results:

string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=16 n=2500000: ./node: 2608500 ./node-master: 2411100 ............. 8.19%
string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=64 n=2500000: ./node: 3812400 ./node-master: 3689900 ............. 3.32%
string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=256 n=2500000: ./node: 3995900 ./node-master: 3666700 ............ 8.98%
string_decoder/string-decoder.js encoding=utf8 inlen=32 chunk=1024 n=2500000: ./node: 3990400 ./node-master: 3734300 ........... 6.86%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=16 n=2500000: ./node: 657720 ./node-master: 593080 ............. 10.90%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=64 n=2500000: ./node: 1454800 ./node-master: 1356500 ............ 7.25%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=256 n=2500000: ./node: 1821500 ./node-master: 1741000 ........... 4.62%
string_decoder/string-decoder.js encoding=utf8 inlen=128 chunk=1024 n=2500000: ./node: 1819900 ./node-master: 1750200 .......... 3.98%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=16 n=2500000: ./node: 83475 ./node-master: 75012 .............. 11.28%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=64 n=2500000: ./node: 175280 ./node-master: 172150 ............. 1.82%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=256 n=2500000: ./node: 266000 ./node-master: 247300 ............ 7.56%
string_decoder/string-decoder.js encoding=utf8 inlen=1024 chunk=1024 n=2500000: ./node: 260120 ./node-master: 252230 ........... 3.13%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=16 n=2500000: ./node: 20576 ./node-master: 17684 .............. 16.35%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=64 n=2500000: ./node: 46176 ./node-master: 43044 ............... 7.27%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=256 n=2500000: ./node: 66595 ./node-master: 63134 .............. 5.48%
string_decoder/string-decoder.js encoding=utf8 inlen=4096 chunk=1024 n=2500000: ./node: 67692 ./node-master: 64320 ............. 5.24%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=16 n=2500000: ./node: 792560 ./node-master: 376330 ...... 110.60%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=64 n=2500000: ./node: 2776000 ./node-master: 1388800 ..... 99.88%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=256 n=2500000: ./node: 2552900 ./node-master: 1250700 ... 104.12%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=32 chunk=1024 n=2500000: ./node: 2791300 ./node-master: 1263100 .. 120.99%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=16 n=2500000: ./node: 242150 ./node-master: 100900 ..... 139.99%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=64 n=2500000: ./node: 731910 ./node-master: 308100 ..... 137.56%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=256 n=2500000: ./node: 1880800 ./node-master: 1005100 ... 87.14%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=128 chunk=1024 n=2500000: ./node: 2031900 ./node-master: 1067100 .. 90.41%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=16 n=2500000: ./node: 28221 ./node-master: 13268 ...... 112.70%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=64 n=2500000: ./node: 110520 ./node-master: 48485 ..... 127.95%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=256 n=2500000: ./node: 281630 ./node-master: 155010 .... 81.69%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=1024 chunk=1024 n=2500000: ./node: 509790 ./node-master: 329090 ... 54.91%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=16 n=2500000: ./node: 7454.5 ./node-master: 3540.4 .... 110.56%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=64 n=2500000: ./node: 24644 ./node-master: 12107 ...... 103.56%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=256 n=2500000: ./node: 74954 ./node-master: 40305 ...... 85.97%
string_decoder/string-decoder.js encoding=base64-utf8 inlen=4096 chunk=1024 n=2500000: ./node: 125180 ./node-master: 92216 .... 35.75%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=16 n=2500000: ./node: 985010 ./node-master: 451520 ..... 118.15%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=64 n=2500000: ./node: 2830500 ./node-master: 1421600 .... 99.10%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=256 n=2500000: ./node: 2853400 ./node-master: 1380000 .. 106.77%
string_decoder/string-decoder.js encoding=base64-ascii inlen=32 chunk=1024 n=2500000: ./node: 2389400 ./node-master: 1483300 .. 61.08%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=16 n=2500000: ./node: 268370 ./node-master: 126350 .... 112.40%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=64 n=2500000: ./node: 756360 ./node-master: 448280 ..... 68.73%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=256 n=2500000: ./node: 2099500 ./node-master: 1075000 .. 95.29%
string_decoder/string-decoder.js encoding=base64-ascii inlen=128 chunk=1024 n=2500000: ./node: 1978000 ./node-master: 1120100 . 76.59%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=16 n=2500000: ./node: 30721 ./node-master: 14321 ..... 114.52%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=64 n=2500000: ./node: 132060 ./node-master: 64124 .... 105.94%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=256 n=2500000: ./node: 324520 ./node-master: 185200 ... 75.23%
string_decoder/string-decoder.js encoding=base64-ascii inlen=1024 chunk=1024 n=2500000: ./node: 581660 ./node-master: 425450 .. 36.71%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=16 n=2500000: ./node: 9096.6 ./node-master: 3610.9 ... 151.92%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=64 n=2500000: ./node: 30843 ./node-master: 14610 ..... 111.11%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=256 n=2500000: ./node: 90907 ./node-master: 46596 ..... 95.10%
string_decoder/string-decoder.js encoding=base64-ascii inlen=4096 chunk=1024 n=2500000: ./node: 150890 ./node-master: 98642 ... 52.97%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=16 n=2500000: ./node: 3743100 ./node-master: 2683600 ......... 39.48%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=64 n=2500000: ./node: 6516300 ./node-master: 4930200 ......... 32.17%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=256 n=2500000: ./node: 6235800 ./node-master: 5263600 ........ 18.47%
string_decoder/string-decoder.js encoding=utf16le inlen=32 chunk=1024 n=2500000: ./node: 6349900 ./node-master: 4837100 ....... 31.28%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=16 n=2500000: ./node: 973190 ./node-master: 777490 .......... 25.17%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=64 n=2500000: ./node: 3124800 ./node-master: 2180500 ........ 43.31%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=256 n=2500000: ./node: 4903100 ./node-master: 3918900 ....... 25.11%
string_decoder/string-decoder.js encoding=utf16le inlen=128 chunk=1024 n=2500000: ./node: 4909800 ./node-master: 3802800 ...... 29.11%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=16 n=2500000: ./node: 131990 ./node-master: 95528 .......... 38.17%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=64 n=2500000: ./node: 401520 ./node-master: 323660 ......... 24.05%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=256 n=2500000: ./node: 964750 ./node-master: 824650 ........ 16.99%
string_decoder/string-decoder.js encoding=utf16le inlen=1024 chunk=1024 n=2500000: ./node: 1424400 ./node-master: 1378400 ...... 3.33%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=16 n=2500000: ./node: 33139 ./node-master: 24806 ........... 33.59%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=64 n=2500000: ./node: 101440 ./node-master: 77741 .......... 30.48%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=256 n=2500000: ./node: 242900 ./node-master: 203300 ........ 19.48%
string_decoder/string-decoder.js encoding=utf16le inlen=4096 chunk=1024 n=2500000: ./node: 358530 ./node-master: 336270 ........ 6.62%


string_decoder/string-decoder-create.js encoding=ascii n=25000000: ./node: 17123000 ./node-master: 7115100 .. 140.66%
string_decoder/string-decoder-create.js encoding=utf8 n=25000000: ./node: 3912500 ./node-master: 2699900 ..... 44.91%
string_decoder/string-decoder-create.js encoding=utf-8 n=25000000: ./node: 3843400 ./node-master: 2118000 .... 81.46%
string_decoder/string-decoder-create.js encoding=base64 n=25000000: ./node: 3112700 ./node-master: 2427900 ... 28.21%
string_decoder/string-decoder-create.js encoding=ucs2 n=25000000: ./node: 3626400 ./node-master: 2487700 ..... 45.77%
string_decoder/string-decoder-create.js encoding=UTF-8 n=25000000: ./node: 2867000 ./node-master: 1915200 .... 49.70%
string_decoder/string-decoder-create.js encoding=AscII n=25000000: ./node: 7559200 ./node-master: 4634900 .... 63.09%
string_decoder/string-decoder-create.js encoding=UTF-16LE n=25000000: ./node: 2690700 ./node-master: 1747700 . 53.96%

@nodejs-github-bot nodejs-github-bot added the buffer Issues and PRs related to the buffer subsystem. label May 15, 2016
@mscdex mscdex added string_decoder Issues and PRs related to the string_decoder subsystem. and removed buffer Issues and PRs related to the buffer subsystem. labels May 15, 2016
throw new Error('Unknown encoding: ' + encoding);
// Do not cache `Buffer.isEncoding` when checking encoding names as some
// modules monkey-patch it to support additional encodings
function normalizeEncoding(enc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it would make sense to move this into an internal util

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I was working on some Buffer changes that might end up doing something similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I had been experimenting with something similar in the speculative icu module work. See https://github.com/jasnell/node/blob/icu-module/lib/icu.js#L113-L134.

@jasnell
Copy link
Member

jasnell commented May 15, 2016

Overall there's nothing that stands out as being a concern here. I would note that the documentation for string_decoder leaves much to be desired but that can be fixed up later. If CI is green then LGTM

@mscdex
Copy link
Contributor Author

mscdex commented May 15, 2016

// characters.
exports.StringDecoder = StringDecoder;
function StringDecoder(encoding) {
this.encoding = normalizeEncoding(encoding);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should default to utf8 when encoding is empty/undefined/generally false-y, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed + test added.

@addaleax
Copy link
Member

CI is green, and LGTM if the fallback to 'utf8' is restored as stated in the docs.

@mscdex
Copy link
Contributor Author

mscdex commented May 16, 2016

CI after a few nits were addressed: https://ci.nodejs.org/job/node-test-pull-request/2652/

@mscdex
Copy link
Contributor Author

mscdex commented May 16, 2016

I had to change the UTF-8 implementation due to being unable to distinguish between explicit and implicit replacement characters. The new UTF-8 implementation now does incomplete character checking similar to the original implementation. UTF-8 performance numbers are still the same however.

@addaleax
Copy link
Member

CI with the most recent changes: https://ci.nodejs.org/job/node-test-commit/3352/
LGTM if it’s green, nice work!

@mscdex
Copy link
Contributor Author

mscdex commented May 18, 2016

@jasnell Does this still LGTY (after the changes I made to the UTF-8 implementation)?

@jasnell
Copy link
Member

jasnell commented May 18, 2016

Yes. I'd like to see if we can make further improvements and optimizations to this but this LGTM

@mscdex
Copy link
Contributor Author

mscdex commented May 18, 2016

@jasnell I'm not sure much more is possible for UTF-8 on the js-side of things. AFAIK the only way now to optimize UTF-8 further would be to try to skip v8's UTF-8 decoder and use a custom C/C++ UTF-8 to UTF-16 decoder and pass the resulting output to v8? I'm assuming v8 doesn't do any additional processing on two-byte strings (created via String::NewFromTwoByte).

@mscdex
Copy link
Contributor Author

mscdex commented May 18, 2016

I'm being extra cautious and marking this as not for v4.x. If it's one thing I've learned, it's that node's tests aren't always up to snuff and may not (always) be good enough for checking rewrites like this. Let's let it stew in current branches first.

@jasnell
Copy link
Member

jasnell commented May 18, 2016

ICU includes a number of highly optimized UTF8/UTF16 utilities that could potentially be leveraged. In fact, we might be able to drop down to a full native implementation of string_decoder that is based entirely on ICU's utilities.

i = 0;
}
if (i < buf.length)
return (r ? r + this.text(buf, i) : this.text(buf, i));
Copy link
Member

@bnoordhuis bnoordhuis May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Removed, misread.

@bnoordhuis
Copy link
Member

LGTM

This commit provides a rewrite of StringDecoder that both improves
performance (for non-single-byte encodings) and understandability.

Additionally, StringDecoder instantiation performance has increased
considerably due to inlinability and more efficient encoding name
checking.

PR-URL: nodejs#6777
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@mscdex
Copy link
Contributor Author

mscdex commented May 29, 2016

One last CI before landing: https://ci.nodejs.org/job/node-test-pull-request/2855/

@mscdex mscdex merged commit d23b7d2 into nodejs:master May 29, 2016
@mscdex mscdex deleted the stringdecoder-rewrite branch May 29, 2016 19:33
Fishrock123 pushed a commit to Fishrock123/node that referenced this pull request May 30, 2016
This commit provides a rewrite of StringDecoder that both improves
performance (for non-single-byte encodings) and understandability.

Additionally, StringDecoder instantiation performance has increased
considerably due to inlinability and more efficient encoding name
checking.

PR-URL: nodejs#6777
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
rvagg pushed a commit that referenced this pull request Jun 2, 2016
This commit provides a rewrite of StringDecoder that both improves
performance (for non-single-byte encodings) and understandability.

Additionally, StringDecoder instantiation performance has increased
considerably due to inlinability and more efficient encoding name
checking.

PR-URL: #6777
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@gagern
Copy link
Contributor

gagern commented Jun 15, 2016

See #7308 for some breakage apparently caused by this change here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
string_decoder Issues and PRs related to the string_decoder subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants