-
Notifications
You must be signed in to change notification settings - Fork 29.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stream: support decoding buffers for Writables #7425
Conversation
CI: https://ci.nodejs.org/job/node-test-commit/3854/ /cc @mscdex @nodejs/streams |
lib/_stream_writable.js
Outdated
@@ -11,6 +11,7 @@ const util = require('util'); | |||
const internalUtil = require('internal/util'); | |||
const Stream = require('stream'); | |||
const Buffer = require('buffer').Buffer; | |||
const StringDecoder = require('string_decoder').StringDecoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might lazy-load this as is done in Readable?
Can you please check if this causes any performance issue? Please fix the nits @mscdex pointed out first. |
Addressed the above comments (both function bodies should be below 600 characters now) & added a benchmark. Output from comparing with master using #7094 shows no significant difference for the existing functionality (no automatic decoding), but feel invited to suggest other/better ways to benchmark:
|
If the stream is in objectMode does that just override the decodeBuffer option? |
Can you please check the "old" net & http benchmarks? |
The only reason I ask is because if it does, it looks like we would still create the string decoder anyway. Even though it would not be used. Please feel free to correct me if I am wrong. |
I'm 👎 on adding more features to streams that can be solved in userland. It's already complicated as it is. You can do this now using a string decoder in a pipeline or your write method impl. |
@mafintosh The same could be said about string decoder in Readable streams ;-) |
@mscdex if anything we should remove that from readable streams / deprecate it as well hehe |
benchmark/streams/writable-simple.js
Outdated
'use strict'; | ||
|
||
const common = require('../common'); | ||
const v8 = require('v8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
@sonewman Yes, the behaviour for object mode is unaltered (no encoding/decoding in any case). If you feel that it is important, I can add a check for object mode. @mcollina Running the benchmarks now, this might take a while.
Not sure what you mean. And yes, of course this can be done in the |
@addaleax the general sentiment is that there are already too many features in streams. @mafintosh Pragmatically, I don't see us removing any features from streams, and this brings parity between @addaleax any news on those benchmarks? |
If the outcome here is a decision based on a general “no new features in streams” rule (and I would totally understand that), could we formalize that by setting the stability index of streams to @mcollina Sorry, had to restart the HTTP benchmarks (and running each one 30× in a row takes a lot of time). Full output of the analysis for the I’ll look into the |
Please do, +/- 1% is definitely in the range of tolerance. |
f36890b
to
66f1091
Compare
Btw, here is the benchmark comparison output now that #7094 has landed (raw data). It’s quite the mix of positive and negative changes, I wouldn’t say there’s a noticable overall impact of this PR. Again, there are some obviously unrelated but significant results (e.g. in |
@addaleax there are a couple of benchmarks on |
I’ll run those again with more iterations to get a clearer picture there (btw, the full 30× run of all HTTP & net benchmarks took over 1½ days… I’m not doing that again :D) |
@addaleax OMG, that is some bad luck. There are 270 benchmarks, so lets just consider This can of cause happen, but I think we should assume it isn't just an accident. Looking at the histogram and kernel estimated distribution, it is clear that the values aren't normally distributed. The t distribution assumes a normal distribution, however the error is often small when considering 30 or more samples (central limit theorem). I'm not sure what causes this behaviour (maybe external changes). Coming to think about it, I'm not sure the measurements are theoretically normally distributed, it is essentially an inverse mean which by the central limit theorem should make them inverse normal distributed not normal distributed. (A fix could be to measure in sec/op, I need to try it out.) However this still doesn't explain the group (sum of two normal) distribution, that I think can only be caused by external changes (maybe we should just randomize the order).
Yes. The benchmarks are "calibrated" for eye balling. Now that we have the statistical tools, the number of iterations in each benchmark needs to be reduced. |
Okay,
About what one would expect if there’s no actual difference. |
Took a closer look at some of the benchmarks:
From both the Shapiro-Wilk test ( It is very hard to eyeball, but I would say that something happened around Just for fun I tried running almost (decreased
which gives:
This is much better and does not show significance, so my best guess is that there where to much external input.
edit 2: while it will theoretically be better to measure in sec/obs, it doesn't matter from a statistical perspective when the obs/sec mean is sufficiently high and its variance is sufficiently low. It is only a normal distribution that is somewhat centered around 0 (not the case here), that becomes bimodal under the reciprocal transformation, otherwise it remains normally distributed. |
c133999
to
83c7a88
Compare
Ping @addaleax ... what do you want to do with this one? |
Support decoding the input of writable streams to a specific decoding before passing it to `_write()`. By default, all data written to a writable stream is encoded into Buffers. This change enables the reverse situation, i.e. when it is desired by the stream implementer to process all input as strings, whether it was passed to `write()` as a string or not. This makes sense for multi-byte character encodings where the buffers that are written using `write()` may contain partial characters, so calling `chunk.toString()` is not universally applicable. Fixes: nodejs#7315
66f1091
to
c4ea4e6
Compare
Kinda still think this should happen, kinda tired of waiting for benchmarks to finish. I’ve rebased this and appreciate reviews… I can try to run the benchmarks sometime soon. But if any @nodejs/collaborators want to help (or even take over), please feel free to do that. |
@addaleax IMHO it'd be better to just write a stream benchmark instead of relying on the |
@mscdex I had run the stream benchmarks for this, and all that net/http benchmarking stuff was just because it was requested in this thread to make sure. If you think the pre-existing stream benchmarks plus the one included in this PR don’t cover enough ground, it would be cool to get some hints as to how they could be extended. |
@addaleax Oh oops, nevermind, I didn't see that you added one already in this PR. |
FWIW here's what I get after simplifying the code a bit (see diff below) and increasing the number of runs to 60 and increasing
diff
diff --git a/lib/_stream_writable.js b/lib/_stream_writable.js
index deb6e08..8d786c0 100644
--- a/lib/_stream_writable.js
+++ b/lib/_stream_writable.js
@@ -295,42 +295,40 @@ Writable.prototype.setDefaultEncoding = function setDefaultEncoding(encoding) {
return this;
};
-function decodeChunk(state, chunk, encoding) {
- if (state.objectMode)
- return chunk;
-
- var sd = state.stringDecoder;
- if (typeof chunk === 'string') {
- if (sd !== null && encoding === sd.encoding && sd.lastNeed === 0)
- return chunk; // No re-encoding encessary.
-
- if (state.decodeStrings !== false || sd !== null)
- chunk = Buffer.from(chunk, encoding);
- }
-
- if (sd !== null) {
- // chunk is always a Buffer now.
- if (state.flushing) {
- chunk = sd.end(chunk);
- } else {
- chunk = sd.write(chunk);
- }
+function decodeString(state, str, encoding) {
+ if (!state.objectMode &&
+ state.decodeStrings !== false &&
+ typeof str === 'string' &&
+ (!state.stringDecoder ||
+ state.stringDecoder.encoding !== encoding ||
+ state.stringDecoder.lastNeed)) {
+ str = Buffer.from(str, encoding);
+ if (state.stringDecoder)
+ str = decodeBuffer(state, str);
}
+ return str;
+}
- return chunk;
+function decodeBuffer(state, buf) {
+ if (state.flushing)
+ return state.stringDecoder.end(buf);
+ else
+ return state.stringDecoder.write(buf);
}
// if we're already writing something, then just put this
// in the queue, and wait our turn. Otherwise, call _write
// If we return false, then we need a drain event, so set that flag.
function writeOrBuffer(stream, state, isBuf, chunk, encoding, cb) {
- var sd = state.stringDecoder;
- if (!isBuf || sd) {
- chunk = decodeChunk(state, chunk, encoding);
+ if (!isBuf) {
+ chunk = decodeString(state, chunk, encoding);
if (chunk instanceof Buffer)
encoding = 'buffer';
- else if (sd)
- encoding = sd.encoding;
+ else if (state.stringDecoder)
+ encoding = state.stringDecoder.encoding;
+ } else if (state.stringDecoder && !state.objectMode) {
+ chunk = decodeBuffer(state, chunk);
+ encoding = state.stringDecoder.encoding;
}
var len = state.objectMode ? 1 : chunk.length; The reason I made the changes is that they seemed to perform closer/better to current master, at least when running the benchmarks and comparing results visually. |
Closing due to lack of any visible forward progress. We can reopen if necessary |
Checklist
make -j4 test
(UNIX) orvcbuild test nosign
(Windows) passesAffected core subsystem(s)
stream
Description of change
See #7315 for motivation.
Support decoding the input of writable streams to a specific decoding before passing it to
_write()
.By default, all data written to a writable stream is encoded into Buffers. This change enables the reverse situation, i.e. when it is desired by the stream implementer to process all input as strings, whether it was passed to
write()
as a string or not.This makes sense for multi-byte character encodings where the buffers that are written using
write()
may contain partial characters, so callingchunk.toString()
is not universally applicable.