Skip to content

Commit ef0230a

Browse files
jasnelltargos
authored andcommitted
url: add fileURLToPathBuffer API
The existing `fileURLToPath()` does not handle the case where the input URL contains percent-encoded characters that are not valid UTF-8 sequences. This can lead to issues, for instance, when the URL is constructed using file names in non-Unicode encodings (like Shift-JIS). This commit introduces a new API, `fileURLToPathBuffer()`, which returns a `Buffer` representing the path, allowing for accurate conversion of file URLs to paths without attempting to decode the percent-encoded bytes into characters. PR-URL: #58700 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Ethan Arrowood <ethan@arrowood.dev> Reviewed-By: LiviaMedeiros <livia@cirno.name>
1 parent dc2f23e commit ef0230a

File tree

6 files changed

+215
-0
lines changed

6 files changed

+215
-0
lines changed

doc/api/url.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1358,6 +1358,26 @@ new URL('file:///hello world').pathname; // Incorrect: /hello%20world
13581358
fileURLToPath('file:///hello world'); // Correct: /hello world (POSIX)
13591359
```
13601360
1361+
### `url.fileURLToPathBuffer(url[, options])`
1362+
1363+
<!--
1364+
added: REPLACEME
1365+
-->
1366+
1367+
* `url` {URL | string} The file URL string or URL object to convert to a path.
1368+
* `options` {Object}
1369+
* `windows` {boolean|undefined} `true` if the `path` should be
1370+
return as a windows filepath, `false` for posix, and
1371+
`undefined` for the system default.
1372+
**Default:** `undefined`.
1373+
* Returns: {Buffer} The fully-resolved platform-specific Node.js file path
1374+
as a {Buffer}.
1375+
1376+
Like `url.fileURLToPath(...)` except that instead of returning a string
1377+
representation of the path, a `Buffer` is returned. This conversion is
1378+
helpful when the input URL contains percent-encoded segments that are
1379+
not valid UTF-8 / Unicode sequences.
1380+
13611381
### `url.format(URL[, options])`
13621382
13631383
<!-- YAML

lib/internal/data_url.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -349,4 +349,5 @@ function isomorphicDecode(input) {
349349

350350
module.exports = {
351351
dataURLProcessor,
352+
percentDecode,
352353
};

lib/internal/url.js

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,9 @@ const {
2929
Symbol,
3030
SymbolIterator,
3131
SymbolToStringTag,
32+
TypedArrayPrototypeGetBuffer,
33+
TypedArrayPrototypeGetByteLength,
34+
TypedArrayPrototypeGetByteOffset,
3235
decodeURIComponent,
3336
} = primordials;
3437

@@ -81,13 +84,17 @@ const {
8184
CHAR_LOWERCASE_Z,
8285
CHAR_PERCENT,
8386
CHAR_PLUS,
87+
CHAR_COLON,
8488
} = require('internal/constants');
8589
const path = require('path');
90+
const { Buffer } = require('buffer');
8691

8792
const {
8893
validateFunction,
8994
} = require('internal/validators');
9095

96+
const { percentDecode } = require('internal/data_url');
97+
9198
const querystring = require('querystring');
9299

93100
const bindingUrl = internalBinding('url');
@@ -1482,6 +1489,76 @@ function getPathFromURLWin32(url) {
14821489
return StringPrototypeSlice(pathname, 1);
14831490
}
14841491

1492+
function getPathBufferFromURLWin32(url) {
1493+
const hostname = url.hostname;
1494+
let pathname = url.pathname;
1495+
// In the getPathFromURLWin32 variant, we scan the input for backslash (\)
1496+
// and forward slash (/) characters, specifically looking for the ASCII/UTF8
1497+
// encoding these and forbidding their use. This is a bit tricky
1498+
// because these may conflict with non-UTF8 encodings. For instance,
1499+
// in shift-jis, %5C identifies the symbol for the Japanese Yen and not the
1500+
// backslash. If we have a url like file:///foo/%5c/bar, then we really have
1501+
// no way of knowing if that %5c is meant to be a backslash \ or a yen sign.
1502+
// Passing in an encoding option does not help since our Buffer encoding only
1503+
// knows about certain specific text encodings and a single file path might
1504+
// actually contain segments that use multiple encodings. It's tricky! So,
1505+
// for this variation where we are producing a buffer, we won't scan for the
1506+
// slashes at all, and instead will decode the bytes literally into the
1507+
// returned Buffer. That said, that can also be tricky because, on windows,
1508+
// the file path separator *is* the ASCII backslash. This is a known issue
1509+
// on windows specific to the Shift-JIS encoding that we're not really going
1510+
// to solve here. Instead, we're going to do the best we can and just
1511+
// interpret the input url as a sequence of bytes.
1512+
1513+
// Because we are converting to a Windows file path here, we need to replace
1514+
// the explicit forward slash separators with backslashes. Note that this
1515+
// intentionally disregards any percent-encoded forward slashes in the path.
1516+
pathname = SideEffectFreeRegExpPrototypeSymbolReplace(FORWARD_SLASH, pathname, '\\');
1517+
1518+
// Now, let's start to build our Buffer. We will initially start with a
1519+
// Buffer allocated to fit in the entire string. Worst case there are no
1520+
// percent encoded characters and we take the string as is. Any invalid
1521+
// percent encodings, e.g. `%ZZ` are ignored and are passed through
1522+
// literally.
1523+
const decodedu8 = percentDecode(Buffer.from(pathname, 'utf8'));
1524+
const decodedPathname = Buffer.from(TypedArrayPrototypeGetBuffer(decodedu8),
1525+
TypedArrayPrototypeGetByteOffset(decodedu8),
1526+
TypedArrayPrototypeGetByteLength(decodedu8));
1527+
if (hostname !== '') {
1528+
// If hostname is set, then we have a UNC path
1529+
// Pass the hostname through domainToUnicode just in case
1530+
// it is an IDN using punycode encoding. We do not need to worry
1531+
// about percent encoding because the URL parser will have
1532+
// already taken care of that for us. Note that this only
1533+
// causes IDNs with an appropriate `xn--` prefix to be decoded.
1534+
1535+
// This is a bit tricky because of the need to convert to a Buffer
1536+
// followed by concatenation of the results.
1537+
const prefix = Buffer.from('\\\\', 'ascii');
1538+
const domain = Buffer.from(domainToUnicode(hostname), 'utf8');
1539+
1540+
return Buffer.concat([prefix, domain, decodedPathname]);
1541+
}
1542+
// Otherwise, it's a local path that requires a drive letter
1543+
// In this case we're only going to pay attention to the second and
1544+
// third bytes in the decodedPathname. If first byte is either an ASCII
1545+
// uppercase letter between 'A' and 'Z' or lowercase letter between
1546+
// 'a' and 'z', and the second byte must be an ASCII `:` or the
1547+
// operation will fail.
1548+
1549+
const letter = decodedPathname[1] | 0x20;
1550+
const sep = decodedPathname[2];
1551+
1552+
if (letter < CHAR_LOWERCASE_A || letter > CHAR_LOWERCASE_Z || // a..z A..Z
1553+
(sep !== CHAR_COLON)) {
1554+
throw new ERR_INVALID_FILE_URL_PATH('must be absolute');
1555+
}
1556+
1557+
// Now, we'll just return everything except the first byte of
1558+
// decodedPathname
1559+
return decodedPathname.subarray(1);
1560+
}
1561+
14851562
function getPathFromURLPosix(url) {
14861563
if (url.hostname !== '') {
14871564
throw new ERR_INVALID_FILE_URL_HOST(platform);
@@ -1500,6 +1577,28 @@ function getPathFromURLPosix(url) {
15001577
return decodeURIComponent(pathname);
15011578
}
15021579

1580+
function getPathBufferFromURLPosix(url) {
1581+
if (url.hostname !== '') {
1582+
throw new ERR_INVALID_FILE_URL_HOST(platform);
1583+
}
1584+
const pathname = url.pathname;
1585+
1586+
// In the getPathFromURLPosix variant, we scan the input for forward slash
1587+
// (/) characters, specifically looking for the ASCII/UTF8 and forbidding
1588+
// its use. This is a bit tricky because these may conflict with non-UTF8
1589+
// encodings. Passing in an encoding option does not help since our Buffer
1590+
// encoding only knows about certain specific text encodings and a single
1591+
// file path might actually contain segments that use multiple encodings.
1592+
// It's tricky! So, for this variation where we are producing a buffer, we
1593+
// won't scan for the slashes at all, and instead will decode the bytes
1594+
// literally into the returned Buffer. We're going to do the best we can and
1595+
// just interpret the input url as a sequence of bytes.
1596+
const u8 = percentDecode(Buffer.from(pathname, 'utf8'));
1597+
return Buffer.from(TypedArrayPrototypeGetBuffer(u8),
1598+
TypedArrayPrototypeGetByteOffset(u8),
1599+
TypedArrayPrototypeGetByteLength(u8));
1600+
}
1601+
15031602
function fileURLToPath(path, options = kEmptyObject) {
15041603
const windows = options?.windows;
15051604
if (typeof path === 'string')
@@ -1511,6 +1610,24 @@ function fileURLToPath(path, options = kEmptyObject) {
15111610
return (windows ?? isWindows) ? getPathFromURLWin32(path) : getPathFromURLPosix(path);
15121611
}
15131612

1613+
// An alternative to fileURLToPath that outputs a Buffer
1614+
// instead of a string. The other fileURLToPath does not
1615+
// handle non-UTF8 encoded percent encodings at all, so
1616+
// converting to a Buffer is necessary in cases where the
1617+
// to string conversion would fail.
1618+
function fileURLToPathBuffer(path, options = kEmptyObject) {
1619+
const windows = options?.windows;
1620+
if (typeof path === 'string') {
1621+
path = new URL(path);
1622+
} else if (!isURL(path)) {
1623+
throw new ERR_INVALID_ARG_TYPE('path', ['string', 'URL'], path);
1624+
}
1625+
if (path.protocol !== 'file:') {
1626+
throw new ERR_INVALID_URL_SCHEME('file');
1627+
}
1628+
return (windows ?? isWindows) ? getPathBufferFromURLWin32(path) : getPathBufferFromURLPosix(path);
1629+
}
1630+
15141631
function pathToFileURL(filepath, options = kEmptyObject) {
15151632
const windows = options?.windows ?? isWindows;
15161633
const isUNC = windows && StringPrototypeStartsWith(filepath, '\\\\');
@@ -1571,6 +1688,7 @@ function getURLOrigin(url) {
15711688

15721689
module.exports = {
15731690
fileURLToPath,
1691+
fileURLToPathBuffer,
15741692
pathToFileURL,
15751693
toPathIfFileURL,
15761694
installObjectURLMethods,

lib/url.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ const {
5959
domainToASCII,
6060
domainToUnicode,
6161
fileURLToPath,
62+
fileURLToPathBuffer,
6263
pathToFileURL: _pathToFileURL,
6364
urlToHttpOptions,
6465
unsafeProtocol,
@@ -1041,5 +1042,6 @@ module.exports = {
10411042
// Utilities
10421043
pathToFileURL,
10431044
fileURLToPath,
1045+
fileURLToPathBuffer,
10441046
urlToHttpOptions,
10451047
};

test/parallel/test-bootstrap-modules.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,8 @@ expected.beforePreExec = new Set([
105105
'Internal Binding wasm_web_api',
106106
'NativeModule internal/events/abort_listener',
107107
'NativeModule internal/modules/typescript',
108+
'NativeModule internal/data_url',
109+
'NativeModule internal/mime',
108110
]);
109111

110112
expected.atRunTime = new Set([
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
'use strict';
2+
3+
const common = require('../common');
4+
5+
// This test does not work on OSX due to the way it handles
6+
// non-Unicode sequences in file names.
7+
if (common.isMacOS) {
8+
common.skip('Test unsupported on OSX');
9+
}
10+
11+
// Unfortunately, the test also does not work on Windows
12+
// because the writeFileSync operation will replace the
13+
// non-Unicode characters with replacement characters when
14+
// it normalizes the path.
15+
if (common.isWindows) {
16+
common.skip('Test unsupported on Windows');
17+
}
18+
19+
const tmpdir = require('../common/tmpdir');
20+
21+
const {
22+
existsSync,
23+
writeFileSync,
24+
} = require('node:fs');
25+
26+
const {
27+
ok,
28+
throws,
29+
} = require('node:assert');
30+
31+
const {
32+
sep,
33+
} = require('node:path');
34+
35+
tmpdir.refresh();
36+
37+
const {
38+
pathToFileURL,
39+
fileURLToPath,
40+
fileURLToPathBuffer,
41+
} = require('node:url');
42+
43+
const kShiftJisName = '%82%A0%82%A2%82%A4';
44+
const kShiftJisBuffer = Buffer.from([0x82, 0xA0, 0x82, 0xA2, 0x82, 0xA4]);
45+
46+
const tmpdirUrl = pathToFileURL(tmpdir.path + sep);
47+
const testPath = new URL(kShiftJisName, tmpdirUrl);
48+
49+
ok(testPath.pathname.endsWith(`/${kShiftJisName}`));
50+
51+
const tmpdirBuffer = Buffer.from(tmpdir.path + sep, 'utf8');
52+
const testPathBuffer = Buffer.concat([tmpdirBuffer, kShiftJisBuffer]);
53+
54+
// We can use the Buffer version of the path to create a file and check
55+
// its existence. But we cannot use the URL version because it contains
56+
// non-Unicode percent-encoded characters.
57+
throws(() => writeFileSync(testPath, 'test'), {
58+
name: 'URIError',
59+
});
60+
61+
writeFileSync(testPathBuffer, 'test');
62+
ok(existsSync(testPathBuffer));
63+
64+
// Using fileURLToPath fails because the URL contains non-Unicode
65+
// percent-encoded characters.
66+
throws(() => existsSync(fileURLToPath(testPath)), {
67+
name: 'URIError',
68+
});
69+
70+
// This variation succeeds because the URL is converted to a buffer
71+
// without trying to interpret the percent-encoded characters.
72+
ok(existsSync(fileURLToPathBuffer(testPath)));

0 commit comments

Comments
 (0)