Fix Unicode byte order mark documentation #1911
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello, I came across this article and noticed several problems with it. https://learn.microsoft.com/en-us/windows/win32/intl/using-byte-order-marks
The statement "Unicode plain text is a sequence of 16-bit code values" is incorrect. Unicode can be encoded in several encodings including UTF-8, UTF-16, and UTF-32. Unicode itself is a mapping of numbers to code points, many of which cannot fit into 16 bits.
The statement "Microsoft uses UTF-16, little endian byte order." is incorrect. Some legacy Microsoft products such as Visual Studio use Windows-1252 by default. Some legacy Microsoft products use the name "Unicode" to refer to UCS-2, which is similar to UTF-16 but is restricted to the Basic Multilingual Plane and is a fixed-width 16-bit encoding. Modern Microsoft products such as Visual Studio Code and .NET use UTF-8 by default, and over 98% of websites use UTF-8, so note that this is recommended for new applications.
The statement "which informs an application receiving the file that the file is byte-ordered" is nonsense. All bytes in a file are in some order, there is no such thing as a file with unordered bytes. The byte order mark is useful for UTF-16 and UTF-32 to indicate whether their byte order is little endian or big endian, not whether they are byte-ordered in general.
This PR attempts to fix these problems. If further tweaks are required to the text, let me know and I can update the PR.