Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: disallow two-byte characters in URL path #16237

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
http: disallow two-byte characters in URL path
This commit changes node's handling of two-byte characters in the path component
of an http URL. Previously, node would just strip the higher byte when
generating the request. So this code:

```
http.request({host: "example.com", port: "80", "/\uFF2e"})
```

would request `http://example.com/.` (`.` is the character for the byte `0x2e`).

This is not useful and can in some cases lead to filter evasion. With this
change, the code generates `ERR_UNESCAPED_CHARACTERS`, just like space and
control characters already did.
  • Loading branch information
bennofs committed Nov 22, 2017
commit 844a470fbb7ee02b21a0b38abcae37eb9d3e0234
16 changes: 8 additions & 8 deletions lib/_http_client.js
Original file line number Diff line number Diff line change
Expand Up @@ -52,20 +52,20 @@ const errors = require('internal/errors');
// checks can greatly outperform the equivalent regexp (tested in V8 5.4).
function isInvalidPath(s) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above this function needs an update as well

var i = 0;
if (s.charCodeAt(0) <= 32) return true;
if (s.charCodeAt(0) <= 32 || s.charCodeAt(0) > 0xFF) return true;
Copy link
Contributor

@mscdex mscdex Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be > 127 instead of > 255 ? UTF-8 starts using multiple bytes after the last ASCII character (127). Or am I misunderstanding the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V8 strings are not UTF-8, but UTF-16 (UCS2). If it was UTF-8, this would not be necessary, as each byte in an UTF-8 multibyte sequence is > 127 so it would be impossible to get ascii characters that way, even if some bytes were stripped.

What happens is that the string ' \uff2e' is represented in node as (uint16_t*){0xff2e} (where {} is used here to denote an array). With the latin1 encoding (that's what gets used for headers), node internally calls WriteOneByte on that string to write it to the socket and WriteOneByte will strip the higher byte of the two-byte character, so we end up with 0x2e. It works fine for all characters that can be represented with a single byte (and in UCS2, that means all characters below 0x100) (whether any non-ascii characters make sense in this context is questionable, but it also doesn't break anything so I opted not to filter that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latin1 encoding (that's what gets used for headers)

Unfortunately, it's not that simple. See #13296 (comment) for an explanation. The two word summary is "it depends" (on the encoding of the body, but that doesn't fit in two words.)

Aside: calling .charCodeAt() twice for the same index is somewhat wasteful, it would be better to cache the result in a variable. (V8's TurboFan compiler can inline the calls but I don't believe it will fold them.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for pointing me to that issue. I thought a bit about something like that while writing the patch, but I couldn't find out where the encoding is actually set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least as long as we aren't using chunked encoding (no idea what that is :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, missed your replies. I don't know if your exact proposal would work but take a look at the _storeHeader() and _send() methods in lib/_http_outgoing.js, that's where the headers are put on the wire and their encoding is determined.

You may also want to look at the _storeHeader() calls in lib/_http_client.js, they compose the status line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts? I am not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time browsing through open issues yesterday and there's already a bug logged against this behaviour. (Would need to go back to find it again.) IMO what mscdex said above makes sense, just cap it at 127 instead. The current behaviour is completely unpredictable and broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the issue I was mentioning #13296 — there's some good discussion there too.

I would prefer what I said above, at least for 9.x (maybe different solution for LTS), but I'll defer to @bnoordhuis and others with more experience with the http module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts?

It's in fairly widespread use. Rejecting it outright would almost certainly result in quite a bit of fallout and lots of frustrated users. Rejecting malformed UTF-8 should be acceptable and appropriate, though.

(But as discussed, what to accept and what to reject depends on the encoding.)

if (++i >= s.length) return false;
if (s.charCodeAt(1) <= 32) return true;
if (s.charCodeAt(1) <= 32 || s.charCodeAt(1) > 0xFF) return true;
if (++i >= s.length) return false;
if (s.charCodeAt(2) <= 32) return true;
if (s.charCodeAt(2) <= 32 || s.charCodeAt(2) > 0xFF) return true;
if (++i >= s.length) return false;
if (s.charCodeAt(3) <= 32) return true;
if (s.charCodeAt(3) <= 32 || s.charCodeAt(3) > 0xFF) return true;
if (++i >= s.length) return false;
if (s.charCodeAt(4) <= 32) return true;
if (s.charCodeAt(4) <= 32 || s.charCodeAt(4) > 0xFF) return true;
if (++i >= s.length) return false;
if (s.charCodeAt(5) <= 32) return true;
if (s.charCodeAt(5) <= 32 || s.charCodeAt(5) > 0xFF) return true;
++i;
for (; i < s.length; ++i)
if (s.charCodeAt(i) <= 32) return true;
if (s.charCodeAt(i) <= 32 || s.charCodeAt(i) > 0xFF) return true;
return false;
}

Expand Down Expand Up @@ -121,7 +121,7 @@ function ClientRequest(options, cb) {
if (path.length <= 39) { // Determined experimentally in V8 5.4
invalidPath = isInvalidPath(path);
} else {
invalidPath = /[\u0000-\u0020]/.test(path);
invalidPath = /[\u0000-\u0020\u0100-\uffff]/.test(path);
}
if (invalidPath)
throw new errors.TypeError('ERR_UNESCAPED_CHARACTERS', 'Request path');
Expand Down
33 changes: 33 additions & 0 deletions test/parallel/test-http-client-invalid-path.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
// Copyright Joyent, Inc. and other Node contributors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: copyright is not needed for new tests.

//
// Permission is hereby granted, free of charge, to any person obtaining a
// copy of this software and associated documentation files (the
// "Software"), to deal in the Software without restriction, including
// without limitation the rights to use, copy, modify, merge, publish,
// distribute, sublicense, and/or sell copies of the Software, and to permit
// persons to whom the Software is furnished to do so, subject to the
// following conditions:
//
// The above copyright notice and this permission notice shall be included
// in all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
// OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
// MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN
// NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
// DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
// OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
// USE OR OTHER DEALINGS IN THE SOFTWARE.

'use strict';
const common = require('../common');
const http = require('http');

common.expectsError(() => {
http.request({
path: '/thisisinvalid\uffe2'
}).end();
}, {
code: 'ERR_UNESCAPED_CHARACTERS',
type: TypeError
});