-
Notifications
You must be signed in to change notification settings - Fork 29.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: decode filenames using UTF-8 in fs.watch #3401
Conversation
This is the correct fix for Windows but not necessarily for other platforms. |
@bnoordhuis UTF-8 is assumed for other platforms in e.g. readdir. |
Yes, that's wrong. |
It's correct for 99% of users. Do we want to fix fs.watch for them now, even if it's technically wrong, or wait until we change literally all of the fs API, which will probably never happen because no one sane would want to use buffers for paths? |
So what I think we need to do is bite the bullet and move the file name / path name munging into a class that contains platform-specific or file system-specific logic for encoding and decoding them. That's going to be a fair bit of churn but that's better than the halfhearted approach where everything is broken in different ways all the time. |
What logic could there be for Unices other than assuming an encoding? And if it assumes UTF-8, then the logic is exactly the same for all platforms.
The only thing objectively broken right now is I don't understand why you insist on making global changes before making |
Just an example but if you consider OS X a UNIX, HFS+ uses NFD decomposition with extensions. Assuming UTF-8 produces wrong results.
Because it's just swapping one bug for another. Maybe it's going to work better now for a majority of users but that's small consolation for the minority whose workflow just broke. I don't feel comfortable signing off on that. |
Shouldn't it be libuv's job to convert that into proper UTF-8? |
It could be. I'm not completely opposed to that but it's a discussion that gets philosophical quickly and it would be a semver-major change at the very least. I do think that in general it's best to do conversion at the edges, which in this case would be the JS <-> C++ border. |
It does convert on Windows. Why should OS X be treated any differently? I thought libuv was supposed to abstract away platform differences. Anyway, my point still stands. You can argue that using UTF-8 on OS X is wrong, but that's how all of
If their workflow would break from that, then why don't I see their complaints about other @nodejs/collaborators other opinions? |
Windows is special. Libuv converts to UTF-16 so it can use the wide-char OS functions. The ANSI functions simply don't cut it in many cases. |
It's UTF-8 with a twist, that's why iconv for example calls it
If you google around for "OS X NFD" or "HFS NFD", you'll find a great many SO questions on the topic.
That's what I was suggesting! |
We had a lengthy discussion about normalization for OSX in #2165, with the gist being that we don't want to normalize to a specific form because of a possible data loss. As for this issue, I think a class that abstracts away platform inconsistencies and returns UTF-8 in all cases would be prefered and be best done in libuv. cc: @jorangreef |
@bnoordhuis from what I gather, it's still valid UTF-8. In which case, I don't see any problem with decoding it as UTF-8. |
Which means we just want to pass filenames as-is, right? Which is exactly what this PR does, just like everywhere else in |
Yes, the closest representation to the original file system would be prefered. What's the issue exactly on Unices if we do it like in this PR? |
I assume this question was directed at @bnoordhuis, because I have no idea. |
@bnoordhuis is right. Although 95% of filesystems use utf-8 as their encoding, one can't assume that every fileysystem is utf-8 or that just because the platform is Windows or Darwin that the encoding is therefore utf-8. Node currently makes this assumption without providing a way to work with filesystems which use other encodings. The current binary string is better as Ben mentioned because it leaves the encoding choice up to the user. If fs.watch converted to utf-8 then this would lose data in some rare edge cases when the encoding is not in fact utf-8. If the utf-8 decoding then replaced some unknown sequences with the replacement character then there would be no way for the user to get back to the actual filename as used by the filesystem. @bnoordhuis, does the current utf-8 decoding also normalize form or does it just bring in whatever utf-8 form is being used in the data (whether mixed, NFC, NFD, HFS+ NFD etc.)? I thought it would do the latter, and if so then it should be safe to use with HFS+ so long as the decoding process does not also normalize form at the same time (this should never be done). Just my two cents, but I think it would be great to make utf-8 the default encoding for all filenames across all fs methods. And then add a path encoding option to the various fs methods such as fs.watch and fs.readdir in case the user knows that the filesystem is actually using another encoding. It should always be up to the user to figure out what encoding the filesystem is using and Node should never try to take this responsibility on itself (apart from assume utf-8 as the default, but give an option to override encoding). Unicode form however should never be normalized and there should never be any option to do this. |
What's the likelihood of someone using
No, it's not better. It "supports" the likely irrelevant edge case in the worst way possible (basically by passing a "binary string", which should have been part of history for about 3 years now), while being inconsistent with the rest of |
I'm curious what you're referring to?
Do you mean removed from
I'd completely forgot about that line in the docs. It's incorrect. Thanks for the reminder (#3441) |
Using
Removed from all APIs. |
@seishun I think you may have skipped over the last part of my comment:
|
@jorangreef Yes, it might be a good idea to have in the future, but |
@bnoordhuis What's the status on this? |
I don't think I have enough context to ±1 this specific change here, but UTF-8-only from libuv or the JS⇄C++ boundary would make sense. It is fact that not all file systems are in UTF-8.
And thus a breaking change for 1% (the number may actually be higher) |
The number is irrelevant. Even if such users exist, they won't be able to read or write any files, as all of |
Taking that step back, I think we all agree that node is buggy, we just disagree on how to fix it. I don't want to derail this PR further so I filed #3519 with a recap of the current situation and a call for proposals. |
One thing is clear - there will be no breaking changes to |
This makes several changes: 1. Allow path/filename to be passed in as a Buffer on fs methods 2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink, fs.readlinkSync and fs.watch. 3. Documentation updates For 1... it's now possible to do: ```js fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { }); ``` For 2... ```js fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { }); fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { }); ``` encoding can also be passed as a string ```js fs.readdir('/fs/foo/bar', 'hex', (err,list) => { }); ``` The default encoding is set to UTF8 so this addresses the discrepency that existed previously between fs.readdir and fs.watch handling filenames differently. Fixes: nodejs#2088 Refs: nodejs#3519 Alternate: nodejs#3401
#5616 landed... which includes decoding filenames in watch as utf8 by default |
Fixes #2088.
cc @bnoordhuis