Skip to content

Conversation

@Coises
Copy link
Contributor

@Coises Coises commented Nov 22, 2025

Add that beginning with Notepad++ 8.8.8, ANSI is disabled when Windows is set to Use Unicode UTF-8 for worldwide language support. Explain why this was done, what happens when Notepad++ opens what users think of as an “ANSI” file, how to determine the Windows setting from the Debug Info and where to find the setting in Windows.

Explain that from Notepad++ 8.8.8, ANSI is disabled when Windows is set to Use Unicode UTF-8 for worldwide language support.
@Coises
Copy link
Contributor Author

Coises commented Nov 22, 2025

Feel free to make this more concise if you can figure out how to do that. It feels long and wordy to me, and like it shouldn’t really be a whole sub-sub-sub-section of its own, but nothing else I tried was any better.

@donho donho self-assigned this Nov 22, 2025
Copy link
Member

@donho donho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pryrt
What's your opinion?

@pryrt
Copy link
Contributor

pryrt commented Nov 23, 2025

@pryrt What's your opinion?

After getting feedback from my two comments above (ie, assuming you agreed that creating a new Encoding section outside of the preferences page was a good idea), I would hijack this PR (or cancel this one and start my own) to do a bigger rework to create a new section for detailed Encoding documentation.

(I've previously created new sections in the manual without "permission", when I thought it was best for the Manual. But since you're involved in this discussion now, I want to make sure we're on the same page before I move forward.)

@donho donho assigned pryrt and unassigned donho Nov 23, 2025
@donho
Copy link
Member

donho commented Nov 23, 2025

@pryrt
OK, It's your PR now.

@Coises
Copy link
Contributor Author

Coises commented Nov 24, 2025

@pryrt If there’s anything I can do that would help, let me know.

@pryrt
Copy link
Contributor

pryrt commented Nov 25, 2025

I've split out most of the Encoding docs into the new encoding.md,
I also clarified that the ANSI setting isn't always 8-bit, as per @Coises's side note

@Coises, please make sure I didn't miss anything that was discussed earlier, or mis-explain things in any of my rewording of your text. Thanks.

@Coises
Copy link
Contributor Author

Coises commented Nov 25, 2025

@pryrt

If MISC > Autodetect character encoding is enabled, Notepad++ will attempt to algorithmically determine the encoding of the file. If the file you open is encoded in UTF-16 (which always has the BOM character), or in UTF-8 with the BOM, then Notepad++ will use the encoding based on the BOM. If the file is an XML file, then if the encoding is defined in the declaration/prolog, Notepad++ will use that encoding for the file. Failing that, Notepad++ will also analyze some of the byte sequences in the file, and if they match patterns common to UTF-8 or one of the character sets, then Notepad++ will use that encoding.

If autodetection is not enabled, or if autodetection does not yield a positive result, Notepad++ will choose the encoding based on the system locale.

This is not quite correct. I think only @donho knows exactly how this works, so hopefully he will correct any inaccuracies below.

I understand that this level of detail is inappropriate for the manual. I just do not know how to be both accurate and user-friendly. This is maddeningly complicated.

First, and regardless of the Autodetect setting, Notepad++ checks for a byte order mark. If there is one, it is taken as definitive and no further questions are asked. (If I’m remembering correctly, even though a file in a legacy 8-bit encoding legitimately can include any sequence of bytes, including one that looks like a byte order mark, there is no way at all to get Notepad++ to interpret a file that begins with a byte order mark as anything but the Unicode format corresponding to that byte order mark. Attempts to change it using the Encoding menu will be ignored. Fortunately this is almost never a practical problem.)

I think the test for encoding defined in XML (and HTML?) files occurs next, also regardless of the Autodetect setting. I am uncertain as to whether the user can override this with an Encoding menu selection. And I think there is some logic to make this use ANSI (and not a character set sub-menu entry) if the character set identified is the system code page... but I don’t know the details.

Then, if and only if Autodetect is checked, there is a heuristic test to see if the file is likely to be one of a number of specific code page encodings. (I do not know the scope of that test, except that based on the discussion Don and I had while he worked on Issue #17057, one of the things it cannot successfully recognize is Windows-1252.)

If all that fails to determine an encoding — once again, regardless of the Autodetect setting — a test is made to see if the file appears to be all ASCII, valid UTF-8 (but not all ASCII), or neither.

If it is ASCII, then if the system code page is 65001 or if the New Documents setting Apply to opened ANSI files is checked, it is opened as UTF-8; else it is opened as ANSI.

If it is not pure ASCII and it is valid UTF-8, it is opened as UTF-8.

If it is not pure ASCII and it is not valid UTF-8, then if the system code page is 65001, it is opened using the entry on the Encoding > Character sets sub-menus for the legacy code page corresponding to the system locale; otherwise it is opened as ANSI.

@Coises
Copy link
Contributor Author

Coises commented Nov 25, 2025

@pryrt

Since any explanation of how encoding detection works is bound to make most people’s eyes glaze over (unless you can work some magic that’s beyond me), I wonder if something should be added in the “Encoding and Use Unicode UTF-8 for worldwide language support” section to note that:

Notepad++ can still open files in the legacy encoding for your system’s locale (so-called “ANSI”) when Use Unicode UTF-8 for worldwide language support is enabled; but it will open them using a selection from the Encoding > Character sets sub-menus instead of ANSI.

I suspect is isn’t worth going into the small ways in which that makes a difference; e.g.: positions and lengths of highlighted text won’t necessarily correspond to the byte positions and lengths in the file on disk, and the length of the document in the editor will be different (longer) than the length of the file, unless the document contains only ASCII characters; it is possible to paste or otherwise enter into the document characters not in the character set, and they will look like they were inserted successfully until the file is saved and then opened again; searches will be in UTF-8 rather than ANSI, which affects \x values over 7f and character ranges ([x-y]) that include non-ASCII characters; Plugins > Converter > ASCII -> HEX will show UTF-8 bytes rather than the bytes that appear in the file; probably other things that haven’t occurred to me.

If there is a good place to say it, it might be worth clarifying that a file opened with any selection from the Encoding > Character sets sub-menus is always converted to UTF-8 on loading and back to the specified character set when saving; it is never edited directly in the selected character set.

@pryrt
Copy link
Contributor

pryrt commented Nov 25, 2025

@Coises,

Okay, I moved the "if option" to only apply to the heuristic portion, and made sure my order follows yours, without trying to get too far into the nitty-gritty details. I also made the brief comment about internal representation.

@Coises
Copy link
Contributor Author

Coises commented Nov 25, 2025

@pryrt:

Okay, I moved the "if option" to only apply to the heuristic portion, and made sure my order follows yours, without trying to get too far into the nitty-gritty details. I also made the brief comment about internal representation.

I think the “Encoding Auto-Detection” and “Encoding and Use Unicode UTF-8 for worldwide language support” sections are very good now. Accurate, as far as I can tell, yet readable.

One problem with “Encoding During Editing”:

It should be clarified: when Notepad++ reads the file, it actually converts the file from whatever encoding it is on the disk, and internally uses the UTF-8 encoding when doing editing and searching – it’s just during file-read and file-write that the file’s encoding is utilized.

This is true except when the Encoding is ANSI. When a file is recognized as ANSI, it is loaded into Scintilla and edited in that encoding. (Perhaps counter-intuitively, when a file is loaded using an option from the Character sets sub-menus, even if it is the same code page as the system default code page, the file is converted to UTF-8 and loaded that way.)

So:

  • If encoding detection/selection results in ANSI, the file is loaded, unchanged, into Scintilla and the document is interpreted in the system code page encoding. (As of 8.8.8, this cannot happen when Use Unicode UTF-8 for worldwide language support is enabled. Before 8.8.8 that combination resulted in erratic behavior.)

  • If encoding detection/selection results in UTF-8, the file is loaded, unchanged, into Scintilla and the document is interpreted as UTF-8.

  • If encoding detection results in UTF-8 with BOM, the first three bytes (the BOM) are skipped, the remainder of the file is loaded, unchanged, into Scintilla, and the document is interpreted as UTF-8.

  • In all other cases, the file is converted from the detected or selected encoding to UTF-8, the converted text is loaded into Scintilla, and the document is interpreted as UTF-8.

Similar considerations apply when file encoding is changed (i.e., when an Encoding menu Convert option is used, or when an Encoding is selected for a new tab that has never been saved).

I’ll let you figure out how to make that comprehensible to normal human beings, as you are clearly better at that than I am.

@pryrt
Copy link
Contributor

pryrt commented Nov 25, 2025

I’ll let you figure out how to make that comprehensible to normal human beings

/me hopes he got it this time

@Coises
Copy link
Contributor Author

Coises commented Nov 25, 2025

@pryrt:

If I may, I’d like to submit an alternative to the first paragraph under “Encoding During Editing.” Use, ignore or synthesize as you think best. For:

It should be clarified: when Notepad++ reads the file, it usually converts the file from whatever encoding it is on the disk and may use a different encoding internally – it is just during file-read and file-write that the file’s real encoding is utilized. For the internal encoding, it will use the system encoding internally if Notepad++ determines the file encoding is “ANSI”, but not if one of the specific character sets is chosen; if Notepad++ is not set to “ANSI”, it will use UTF-8 encoding internally. (The same is true when you change the encoding from whatever was originally chosen.)

consider:

Notepad++ does not always let you edit a document in the same encoding used to store it in its file. Most of the time this is a technicality that won’t matter to you, but it is good to be aware of the details. When the encoding (shown in the Encoding menu and in [status bar](#status-bar) area 5) is ANSI or UTF-8, you are editing the document in the same encoding as the file. In all other cases (UTF-16 or anything from the Character sets sub-menus), you are editing the document as UTF-8, and Notepad++ converts from or to the chosen encoding when opening or saving the file.

(I question “usually” in the original paragraph because in practice, most of the files most people open, if they don’t have the new Windows Unicode option enabled, will be edited without any conversion, since most are going to be either ANSI or UTF-8.)

(I know “it is good to be aware of the details” begs the question, “Why is it good? Why would I care?”; but I think the answer to that is too geeky for general consumption.)

I think the next paragraph, about BOMs, is great as it is.

@pryrt
Copy link
Contributor

pryrt commented Nov 25, 2025

I removed your second sentence and replaced it with a parenthetical showing when it does matter (plugins like HexEdit plugin get confused by the internal representation, though I didn't call out any plugin by name) -- but I've now switched it to mostly your paragraph.

@Coises
Copy link
Contributor Author

Coises commented Nov 26, 2025

@pryrt:

I removed your second sentence and replaced it with a parenthetical showing when it does matter (plugins like HexEdit plugin get confused by the internal representation, though I didn't call out any plugin by name) -- but I've now switched it to mostly your paragraph.

That’s good. I like it.

I mucked up the link for “status bar” and you copied my mistake. It’s in a different file, so I guess it has to be something like ../user-interface/index.html#status-bar — however that gets done in markdown+Hugo. Sorry about that; I wrote without testing.

@pryrt
Copy link
Contributor

pryrt commented Nov 26, 2025

I should've verified the link before the last commit. Confirmed it's working now.

And since I did that, I audited the other links on the new page, because many needed to update to pointing to the preferences page. So that's been fixed, too.

I'll probably let things sit, and look it all over again tomorrow. If I don't see anything else, and you have no other comments, I'll probably publish tomorrow.

@Coises
Copy link
Contributor Author

Coises commented Nov 26, 2025

@pryrt:

I'll probably let things sit, and look it all over again tomorrow. If I don't see anything else, and you have no other comments, I'll probably publish tomorrow.

I think I’ve run out of things to complain about. ;-)

Thanks for all your work on this — I think it will be much more helpful to users now than my initial changes would have been.


As of Notepad++ version 8.8.8, the **ANSI** and **Convert to ANSI** entries on the **Encoding** menu are disabled when the Windows setting **Use Unicode UTF-8 for worldwide language support** is enabled. When that setting is in effect, the system default code page, which ordinarily defines “ANSI” in Windows, *is* UTF-8; attempting to treat UTF-8 as an ordinary code page does not work properly, which caused erratic behavior prior to version 8.8.8. Since the traditional concept of “ANSI” has no consistent meaning when that Windows setting is enabled, Notepad++ disables `ANSI` encoding. (But even with that OS option set, Notepad++ can still choose one of the Character Set encodings; it just manually selects that entry, not setting it to "ANSI".)

Some Windows 11 installations are coming with that option turned on by default. If you need to be able to use the **Convert to ANSI** action, and you find it's disabled in Notepad++ v8.8.8 or newer (or if that conversion doesn't behave as expected on older versions of Notepad++), you can verify in **?**-menu's **Debug Info**: it will show `Current ANSI codepage: 65001` if that Windows OS option is on. If you want to chance that Windows OS setting, Microsoft provides multiple paths to that setting, but two of the common ways to find it are:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chance in this para should be change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I was trying to indicate the risk involved in using Windows OS settings... ;-)

@pryrt pryrt merged commit df7a75f into notepad-plus-plus:master Nov 26, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants