fs: decode filenames using UTF-8 in fs.watch #3401

seishun · 2015-10-16T12:25:55Z

Fixes #2088.

cc @bnoordhuis

Fixes #2088.

bnoordhuis · 2015-10-17T15:34:34Z

This is the correct fix for Windows but not necessarily for other platforms.

seishun · 2015-10-17T15:37:55Z

@bnoordhuis UTF-8 is assumed for other platforms in e.g. readdir.

bnoordhuis · 2015-10-17T15:40:58Z

Yes, that's wrong.

seishun · 2015-10-17T15:46:55Z

It's correct for 99% of users. Do we want to fix fs.watch for them now, even if it's technically wrong, or wait until we change literally all of the fs API, which will probably never happen because no one sane would want to use buffers for paths?

bnoordhuis · 2015-10-17T16:05:15Z

So what I think we need to do is bite the bullet and move the file name / path name munging into a class that contains platform-specific or file system-specific logic for encoding and decoding them. That's going to be a fair bit of churn but that's better than the halfhearted approach where everything is broken in different ways all the time.

seishun · 2015-10-17T16:56:01Z

a class that contains platform-specific or file system-specific logic for encoding and decoding them

What logic could there be for Unices other than assuming an encoding? And if it assumes UTF-8, then the logic is exactly the same for all platforms.

where everything is broken in different ways all the time

The only thing objectively broken right now is fs.watch. The theoretical breakage of non-UTF8 filenames on Unices is consistent across all of fs APIs as far as I'm aware.

I don't understand why you insist on making global changes before making fs.watch consistent with the rest of the fs API, whatever they might be.

bnoordhuis · 2015-10-17T17:31:28Z

What logic could there be for Unices other than assuming an encoding?

Just an example but if you consider OS X a UNIX, HFS+ uses NFD decomposition with extensions. Assuming UTF-8 produces wrong results.

fs.watch() doesn't produce correct results now either but the one-byte encoding means you can at least patch it up post facto.

I don't understand why you insist on making global changes before making fs.watch consistent with the rest of the fs API, whatever they might be.

Because it's just swapping one bug for another. Maybe it's going to work better now for a majority of users but that's small consolation for the minority whose workflow just broke. I don't feel comfortable signing off on that.

seishun · 2015-10-17T19:59:12Z

Just an example but if you consider OS X a UNIX, HFS+ uses NFD decomposition with extensions. Assuming UTF-8 produces wrong results.

Shouldn't it be libuv's job to convert that into proper UTF-8?

bnoordhuis · 2015-10-17T20:33:05Z

It could be. I'm not completely opposed to that but it's a discussion that gets philosophical quickly and it would be a semver-major change at the very least. I do think that in general it's best to do conversion at the edges, which in this case would be the JS <-> C++ border.

seishun · 2015-10-17T21:04:28Z

It could be. I'm not completely opposed to that but it's a discussion that gets philosophical quickly and it would be a semver-major change at the very least. I do think that in general it's best to do conversion at the edges, which in this case would be the JS <-> C++ border.

It does convert on Windows. Why should OS X be treated any differently? I thought libuv was supposed to abstract away platform differences.

Anyway, my point still stands. You can argue that using UTF-8 on OS X is wrong, but that's how all of fs works. I don't see why fs.watch should be an exception (and in an extremely unintuitive way, too). If there really is an issue with OS X filenames (and so far I haven't found any source confirming that their filenames aren't valid UTF-8), then it should be fixed globally for all of fs at some point.

but that's small consolation for the minority whose workflow just broke

If their workflow would break from that, then why don't I see their complaints about other fs functions?

@nodejs/collaborators other opinions?

bnoordhuis · 2015-10-17T21:24:47Z

It does convert on Windows.

Windows is special. Libuv converts to UTF-16 so it can use the wide-char OS functions. The ANSI functions simply don't cut it in many cases.

bnoordhuis · 2015-10-17T21:37:11Z

If there really is an issue with OS X filenames (and so far I haven't found any source confirming that their filenames aren't valid UTF-8),

It's UTF-8 with a twist, that's why iconv for example calls it utf-8-mac. From here:

HFS Plus (Mac OS Extended) uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed

If you google around for "OS X NFD" or "HFS NFD", you'll find a great many SO questions on the topic.

then it should be fixed globally for all of fs at some point.

That's what I was suggesting!

silverwind · 2015-10-17T22:08:19Z

We had a lengthy discussion about normalization for OSX in #2165, with the gist being that we don't want to normalize to a specific form because of a possible data loss.

As for this issue, I think a class that abstracts away platform inconsistencies and returns UTF-8 in all cases would be prefered and be best done in libuv.

cc: @jorangreef

seishun · 2015-10-18T10:34:07Z

@bnoordhuis from what I gather, it's still valid UTF-8. In which case, I don't see any problem with decoding it as UTF-8.

seishun · 2015-10-18T13:26:32Z

@silverwind

with the gist being that we don't want to normalize to a specific form because of a possible data loss.

Which means we just want to pass filenames as-is, right? Which is exactly what this PR does, just like everywhere else in fs.

silverwind · 2015-10-19T00:05:44Z

Yes, the closest representation to the original file system would be prefered. What's the issue exactly on Unices if we do it like in this PR?

seishun · 2015-10-19T06:21:46Z

I assume this question was directed at @bnoordhuis, because I have no idea.

jorangreef · 2015-10-19T07:18:01Z

@bnoordhuis is right. Although 95% of filesystems use utf-8 as their encoding, one can't assume that every fileysystem is utf-8 or that just because the platform is Windows or Darwin that the encoding is therefore utf-8. Node currently makes this assumption without providing a way to work with filesystems which use other encodings.

The current binary string is better as Ben mentioned because it leaves the encoding choice up to the user. If fs.watch converted to utf-8 then this would lose data in some rare edge cases when the encoding is not in fact utf-8. If the utf-8 decoding then replaced some unknown sequences with the replacement character then there would be no way for the user to get back to the actual filename as used by the filesystem.

@bnoordhuis, does the current utf-8 decoding also normalize form or does it just bring in whatever utf-8 form is being used in the data (whether mixed, NFC, NFD, HFS+ NFD etc.)? I thought it would do the latter, and if so then it should be safe to use with HFS+ so long as the decoding process does not also normalize form at the same time (this should never be done).

Just my two cents, but I think it would be great to make utf-8 the default encoding for all filenames across all fs methods. And then add a path encoding option to the various fs methods such as fs.watch and fs.readdir in case the user knows that the filesystem is actually using another encoding. It should always be up to the user to figure out what encoding the filesystem is using and Node should never try to take this responsibility on itself (apart from assume utf-8 as the default, but give an option to override encoding). Unicode form however should never be normalized and there should never be any option to do this.

seishun · 2015-10-19T19:40:04Z

What's the likelihood of someone using fs.watch on an exotic filesystem while disregarding the fact that they are completely screwed by the rest of fs? (And discovering the Buffer hack instead of giving up or submitting an issue when fs.watch gave them garbage)

The current binary string is better as Ben mentioned because it leaves the encoding choice up to the user.

No, it's not better. It "supports" the likely irrelevant edge case in the worst way possible (basically by passing a "binary string", which should have been part of history for about 3 years now), while being inconsistent with the rest of fs. What kind of a first impression will a new user of Node.js get when they are told that fs.readdir passes correct filenames while fs.watch requires an undocumented hack with a deprecated encoding?

trevnorris · 2015-10-19T20:19:33Z

"And discovering the Buffer hack"
"requires an undocumented hack"

I'm curious what you're referring to?

(basically by passing a "binary string", which should have been part of history for about 3 years now)

Do you mean removed from fs or removed completely?

with a deprecated encoding

I'd completely forgot about that line in the docs. It's incorrect. Thanks for the reminder (#3441)

seishun · 2015-10-20T05:59:34Z

I'm curious what you're referring to?

Using new Buffer(str, 'binary') to convert it back to a byte array.

Do you mean removed from fs or removed completely?

Removed from all APIs.

jorangreef · 2015-10-20T06:41:17Z

@seishun I think you may have skipped over the last part of my comment:

Just my two cents, but I think it would be great to make utf-8 the default encoding for all filenames across all fs methods. And then add a path encoding option to the various fs methods such as fs.watch and fs.readdir in case the user knows that the filesystem is actually using another encoding. It should always be up to the user to figure out what encoding the filesystem is using and Node should never try to take this responsibility on itself (apart from assume utf-8 as the default, but give an option to override encoding). Unicode form however should never be normalized and there should never be any option to do this.

seishun · 2015-10-20T07:47:00Z

@jorangreef Yes, it might be a good idea to have in the future, but fs.watch shouldn't remain in its current state until then.

seishun · 2015-10-21T12:17:09Z

@bnoordhuis What's the status on this?

srl295 · 2015-10-23T16:07:27Z

I don't think I have enough context to ±1 this specific change here, but UTF-8-only from libuv or the JS⇄C++ boundary would make sense.

It is fact that not all file systems are in UTF-8.

It's correct for 99% of users. Do we want to fix fs.watch for them now, even if it's technically wrong

And thus a breaking change for 1% (the number may actually be higher)

seishun · 2015-10-23T16:18:50Z

And thus a breaking change for 1% (the number may actually be higher)

The number is irrelevant. Even if such users exist, they won't be able to read or write any files, as all of fs assumes UTF-8. I doubt they'd be able to run Node.js scripts at all. fs.watch is the least of their concern, singling it out doesn't make sense.

bnoordhuis · 2015-10-25T22:05:03Z

Taking that step back, I think we all agree that node is buggy, we just disagree on how to fix it. I don't want to derail this PR further so I filed #3519 with a recap of the current situation and a call for proposals.

seishun · 2015-10-27T20:22:37Z

One thing is clear - there will be no breaking changes to fs as the result of #3519. So I don't see any reason why a fix to fs.watch must be postponed until that.

This makes several changes: 1. Allow path/filename to be passed in as a Buffer on fs methods 2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink, fs.readlinkSync and fs.watch. 3. Documentation updates For 1... it's now possible to do: ```js fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { }); ``` For 2... ```js fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { }); fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { }); ``` encoding can also be passed as a string ```js fs.readdir('/fs/foo/bar', 'hex', (err,list) => { }); ``` The default encoding is set to UTF8 so this addresses the discrepency that existed previously between fs.readdir and fs.watch handling filenames differently. Fixes: nodejs#2088 Refs: nodejs#3519 Alternate: nodejs#3401

jasnell · 2016-04-02T03:50:29Z

#5616 landed... which includes decoding filenames in watch as utf8 by default

fs: decode filenames using UTF-8 in fs.watch

6d71679

Fixes #2088.

mscdex added the fs Issues and PRs related to the fs subsystem / file system. label Oct 16, 2015

seishun added the relnotes label Oct 22, 2015

srl295 mentioned this pull request Oct 23, 2015

Investigate FS and other codepage dependencies nodejs/Intl#25

Closed

This was referenced Oct 28, 2015

Node Foundation CTC Meeting 2015-10-28 #3561

Closed

Release proposal: v5.0.0 #3466

Merged

Node Foundation CTC Meeting 2015-11-04 #3660

Closed

rvagg removed the tsc-agenda label Nov 4, 2015

Trott force-pushed the master branch from 1e896a6 to 082cc8d Compare December 27, 2015 02:00

MylesBorins mentioned this pull request Jan 11, 2016

V5.4.1 propose #4626

Merged

seishun mentioned this pull request Mar 9, 2016

fs: add Buffer support in fs methods #5616

Closed

4 tasks

jasnell closed this Apr 2, 2016

snyk-bot mentioned this pull request Aug 10, 2020

[Snyk] Upgrade rollup from 1.30.1 to 1.32.1 - autoclosed turkdevops/node#3

Closed

leonardoadame mentioned this pull request Nov 28, 2023

[Snyk] Fix for 6 vulnerabilities leonardoadame/node#185

Open

aliscco mentioned this pull request Nov 28, 2023

[Snyk] Fix for 5 vulnerabilities aliscco/alisco-node#694

Open

leonardoadame mentioned this pull request Nov 29, 2023

[Snyk] Fix for 6 vulnerabilities leonardoadame/node#210

Open

This was referenced Nov 29, 2023

[Snyk] Fix for 4 vulnerabilities aliscco/alisco-node#714

Open

[Snyk] Fix for 5 vulnerabilities aliscco/alisco-node#732

Open

leonardoadame mentioned this pull request Nov 29, 2023

[Snyk] Fix for 5 vulnerabilities leonardoadame/node#263

Open

This was referenced Nov 29, 2023

[Snyk] Fix for 14 vulnerabilities aliscco/alisco-node#776

Open

[Snyk] Fix for 6 vulnerabilities aliscco/alisco-node#1023

Open

Uh oh!

fs: decode filenames using UTF-8 in fs.watch #3401

fs: decode filenames using UTF-8 in fs.watch #3401

Uh oh!

Conversation

seishun commented Oct 16, 2015

Uh oh!

bnoordhuis commented Oct 17, 2015

Uh oh!

seishun commented Oct 17, 2015

Uh oh!

bnoordhuis commented Oct 17, 2015

Uh oh!

seishun commented Oct 17, 2015

Uh oh!

bnoordhuis commented Oct 17, 2015

Uh oh!

seishun commented Oct 17, 2015

Uh oh!

bnoordhuis commented Oct 17, 2015

Uh oh!

seishun commented Oct 17, 2015

Uh oh!

bnoordhuis commented Oct 17, 2015

Uh oh!

seishun commented Oct 17, 2015

Uh oh!

bnoordhuis commented Oct 17, 2015

Uh oh!

bnoordhuis commented Oct 17, 2015

Uh oh!

silverwind commented Oct 17, 2015

Uh oh!

seishun commented Oct 18, 2015

Uh oh!

seishun commented Oct 18, 2015

Uh oh!

silverwind commented Oct 19, 2015

Uh oh!

seishun commented Oct 19, 2015

Uh oh!

jorangreef commented Oct 19, 2015

Uh oh!

seishun commented Oct 19, 2015

Uh oh!

trevnorris commented Oct 19, 2015

Uh oh!

seishun commented Oct 20, 2015

Uh oh!

jorangreef commented Oct 20, 2015

Uh oh!

seishun commented Oct 20, 2015

Uh oh!

seishun commented Oct 21, 2015

Uh oh!

srl295 commented Oct 23, 2015

Uh oh!

seishun commented Oct 23, 2015

Uh oh!

bnoordhuis commented Oct 25, 2015

Uh oh!

seishun commented Oct 27, 2015

Uh oh!

jasnell commented Apr 2, 2016

Uh oh!

Uh oh!