-
Notifications
You must be signed in to change notification settings - Fork 108
Explain that from Notepad++ 8.8.8, ANSI is disabled when Windows is set to Use Unicode UTF-8 for worldwide language support. #841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Explain that from Notepad++ 8.8.8, ANSI is disabled when Windows is set to Use Unicode UTF-8 for worldwide language support.
|
Feel free to make this more concise if you can figure out how to do that. It feels long and wordy to me, and like it shouldn’t really be a whole sub-sub-sub-section of its own, but nothing else I tried was any better. |
donho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pryrt
What's your opinion?
After getting feedback from my two comments above (ie, assuming you agreed that creating a new Encoding section outside of the preferences page was a good idea), I would hijack this PR (or cancel this one and start my own) to do a bigger rework to create a new section for detailed Encoding documentation. (I've previously created new sections in the manual without "permission", when I thought it was best for the Manual. But since you're involved in this discussion now, I want to make sure we're on the same page before I move forward.) |
|
@pryrt |
|
@pryrt If there’s anything I can do that would help, let me know. |
This is not quite correct. I think only @donho knows exactly how this works, so hopefully he will correct any inaccuracies below. I understand that this level of detail is inappropriate for the manual. I just do not know how to be both accurate and user-friendly. This is maddeningly complicated. First, and regardless of the Autodetect setting, Notepad++ checks for a byte order mark. If there is one, it is taken as definitive and no further questions are asked. (If I’m remembering correctly, even though a file in a legacy 8-bit encoding legitimately can include any sequence of bytes, including one that looks like a byte order mark, there is no way at all to get Notepad++ to interpret a file that begins with a byte order mark as anything but the Unicode format corresponding to that byte order mark. Attempts to change it using the Encoding menu will be ignored. Fortunately this is almost never a practical problem.) I think the test for encoding defined in XML (and HTML?) files occurs next, also regardless of the Autodetect setting. I am uncertain as to whether the user can override this with an Encoding menu selection. And I think there is some logic to make this use ANSI (and not a character set sub-menu entry) if the character set identified is the system code page... but I don’t know the details. Then, if and only if Autodetect is checked, there is a heuristic test to see if the file is likely to be one of a number of specific code page encodings. (I do not know the scope of that test, except that based on the discussion Don and I had while he worked on Issue #17057, one of the things it cannot successfully recognize is Windows-1252.) If all that fails to determine an encoding — once again, regardless of the Autodetect setting — a test is made to see if the file appears to be all ASCII, valid UTF-8 (but not all ASCII), or neither. If it is ASCII, then if the system code page is 65001 or if the New Documents setting Apply to opened ANSI files is checked, it is opened as UTF-8; else it is opened as ANSI. If it is not pure ASCII and it is valid UTF-8, it is opened as UTF-8. If it is not pure ASCII and it is not valid UTF-8, then if the system code page is 65001, it is opened using the entry on the Encoding > Character sets sub-menus for the legacy code page corresponding to the system locale; otherwise it is opened as ANSI. |
|
Since any explanation of how encoding detection works is bound to make most people’s eyes glaze over (unless you can work some magic that’s beyond me), I wonder if something should be added in the “Encoding and Use Unicode UTF-8 for worldwide language support” section to note that: Notepad++ can still open files in the legacy encoding for your system’s locale (so-called “ANSI”) when Use Unicode UTF-8 for worldwide language support is enabled; but it will open them using a selection from the Encoding > Character sets sub-menus instead of ANSI. I suspect is isn’t worth going into the small ways in which that makes a difference; e.g.: positions and lengths of highlighted text won’t necessarily correspond to the byte positions and lengths in the file on disk, and the length of the document in the editor will be different (longer) than the length of the file, unless the document contains only ASCII characters; it is possible to paste or otherwise enter into the document characters not in the character set, and they will look like they were inserted successfully until the file is saved and then opened again; searches will be in UTF-8 rather than ANSI, which affects \x values over 7f and character ranges ([x-y]) that include non-ASCII characters; Plugins > Converter > ASCII -> HEX will show UTF-8 bytes rather than the bytes that appear in the file; probably other things that haven’t occurred to me. If there is a good place to say it, it might be worth clarifying that a file opened with any selection from the Encoding > Character sets sub-menus is always converted to UTF-8 on loading and back to the specified character set when saving; it is never edited directly in the selected character set. |
|
Okay, I moved the "if option" to only apply to the heuristic portion, and made sure my order follows yours, without trying to get too far into the nitty-gritty details. I also made the brief comment about internal representation. |
I think the “Encoding Auto-Detection” and “Encoding and Use Unicode UTF-8 for worldwide language support” sections are very good now. Accurate, as far as I can tell, yet readable. One problem with “Encoding During Editing”:
This is true except when the Encoding is ANSI. When a file is recognized as ANSI, it is loaded into Scintilla and edited in that encoding. (Perhaps counter-intuitively, when a file is loaded using an option from the Character sets sub-menus, even if it is the same code page as the system default code page, the file is converted to UTF-8 and loaded that way.) So:
Similar considerations apply when file encoding is changed (i.e., when an Encoding menu Convert option is used, or when an Encoding is selected for a new tab that has never been saved). I’ll let you figure out how to make that comprehensible to normal human beings, as you are clearly better at that than I am. |
/me hopes he got it this time |
|
If I may, I’d like to submit an alternative to the first paragraph under “Encoding During Editing.” Use, ignore or synthesize as you think best. For:
consider: Notepad++ does not always let you edit a document in the same encoding used to store it in its file. Most of the time this is a technicality that won’t matter to you, but it is good to be aware of the details. When the encoding (shown in the Encoding menu and in [status bar](#status-bar) area 5) is ANSI or UTF-8, you are editing the document in the same encoding as the file. In all other cases (UTF-16 or anything from the Character sets sub-menus), you are editing the document as UTF-8, and Notepad++ converts from or to the chosen encoding when opening or saving the file. (I question “usually” in the original paragraph because in practice, most of the files most people open, if they don’t have the new Windows Unicode option enabled, will be edited without any conversion, since most are going to be either ANSI or UTF-8.) (I know “it is good to be aware of the details” begs the question, “Why is it good? Why would I care?”; but I think the answer to that is too geeky for general consumption.) I think the next paragraph, about BOMs, is great as it is. |
|
I removed your second sentence and replaced it with a parenthetical showing when it does matter (plugins like HexEdit plugin get confused by the internal representation, though I didn't call out any plugin by name) -- but I've now switched it to mostly your paragraph. |
That’s good. I like it. I mucked up the link for “status bar” and you copied my mistake. It’s in a different file, so I guess it has to be something like |
|
I should've verified the link before the last commit. Confirmed it's working now. And since I did that, I audited the other links on the new page, because many needed to update to pointing to the preferences page. So that's been fixed, too. I'll probably let things sit, and look it all over again tomorrow. If I don't see anything else, and you have no other comments, I'll probably publish tomorrow. |
I think I’ve run out of things to complain about. ;-) Thanks for all your work on this — I think it will be much more helpful to users now than my initial changes would have been. |
content/docs/encoding.md
Outdated
|
|
||
| As of Notepad++ version 8.8.8, the **ANSI** and **Convert to ANSI** entries on the **Encoding** menu are disabled when the Windows setting **Use Unicode UTF-8 for worldwide language support** is enabled. When that setting is in effect, the system default code page, which ordinarily defines “ANSI” in Windows, *is* UTF-8; attempting to treat UTF-8 as an ordinary code page does not work properly, which caused erratic behavior prior to version 8.8.8. Since the traditional concept of “ANSI” has no consistent meaning when that Windows setting is enabled, Notepad++ disables `ANSI` encoding. (But even with that OS option set, Notepad++ can still choose one of the Character Set encodings; it just manually selects that entry, not setting it to "ANSI".) | ||
|
|
||
| Some Windows 11 installations are coming with that option turned on by default. If you need to be able to use the **Convert to ANSI** action, and you find it's disabled in Notepad++ v8.8.8 or newer (or if that conversion doesn't behave as expected on older versions of Notepad++), you can verify in **?**-menu's **Debug Info**: it will show `Current ANSI codepage: 65001` if that Windows OS option is on. If you want to chance that Windows OS setting, Microsoft provides multiple paths to that setting, but two of the common ways to find it are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chance in this para should be change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I was trying to indicate the risk involved in using Windows OS settings... ;-)
Add that beginning with Notepad++ 8.8.8, ANSI is disabled when Windows is set to Use Unicode UTF-8 for worldwide language support. Explain why this was done, what happens when Notepad++ opens what users think of as an “ANSI” file, how to determine the Windows setting from the Debug Info and where to find the setting in Windows.