-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-42236: os.device_encoding() respects UTF-8 Mode #23119
Conversation
@methane: Would you mind to have a look at this change? I'm not sure if it's correct to replace I'm not sure if it's the same on Android or not? IMO this change makes os.devide_encoding() and so indirectly open() more consistent with encoding choices in Python. By the way, I deeply reworked the documentation on encodings, especially the locale encoding the filesystem encoding and error handler (docs.python.org was not updated yet). |
Python/fileutils.c
Outdated
const PyPreConfig *preconfig = &_PyRuntime.preconfig; | ||
if (preconfig->utf8_mode) { | ||
return PyUnicode_FromString("UTF-8"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If UTF-8 mode doesn't override PYTHONLEGACYWINDOWSSTDIO
, then this ends up using UTF-8 with io.FileIO
instances opened for console files. This will read mixed garbage and "surrogatepass" from console input and print garbage to the screen.
Currently in 3.9 PYTHONLEGACYWINDOWSSTDIO
is broken separately from UTF-8 mode. Somehow it ends up incorrectly using the process ANSI codepage instead of the console input and output codepages:
C:\>chcp 850
Active code page: 850
C:\>set PYTHONLEGACYWINDOWSSTDIO=1
C:\>py -3.9 -c "import sys; print(sys.stdin.encoding)"
cp1252
Note that even for non-legacy operation, with or without UTF-8 mode, the console input and output code pages are still used with os.read
and os.write
. Overriding the result from os.device_encoding
as "UTF-8" loses important information about the correct encoding to use in those cases:
>>> os.write(1, 'αβψ\n'.encode('UTF-8'))
aá?
5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a problem. PYTHONLEGACYWINDOWSSTDIO
is disabled by default. This option should be enabled only when the program is using some hack which relying on legacy stdout behavior. So trying to make
PYTHONLEGACYWINDOWSSTDIO perfect will be waste of time.
In this case, overriding the option means we are removing one combination of options from users. I don't think it is nice idea.
For example, some program use PYTHONLEGACYWINDOWSSTDIO
to redirect stdout to file by hack. See here:
If user want to redirect stdout to file and use UTF-8 in the file, combining PYTHONLEGACYWINDOWSSTDIO
and PYTHONUTF8
make sense, although it seems really ugly hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what the rationale is for forcing console bytes I/O to use UTF-8. A console in Windows is not an agnostic bytes medium, unlike a new disk file, pipe, or socket. Internally it uses 16-bit characters. The console bytes API uses best-fit code-page translation between native wide-character strings and byte strings. If the console says the input code page is 437 or output code page is 850, then there is no doubt that bytes I/O (e.g. os.read
and os.write
) is limited to those code pages.
Because the console bytes API uses a best-fit translation, the result doesn't even round trip necessarily to what the user enters, which can be surprising. For example, with the input and output code pages set to 850:
>>> s = os.read(0, 5).decode('utf-8', 'surrogateescape')
αβψ
Based on visual feedback it looks like this worked, but on inspection we see that the console mapped αβψ
to aß?
when asked to read it as a byte string.
>>> s
'a\udce1?\r\n'
>>> n = os.write(1, s.encode('utf-8', 'surrogateescape'))
aß?
The situation in POSIX is similar if the terminal uses a legacy encoding, but I don't recall encountering a best-fit translation in POSIX. If the terminal encoding doesn't support a typed or pasted character, it just gets ignored, which is immediately obvious to the user. As far as I know, detecting the terminal encoding is not even possible in POSIX. Anyway, configuring a POSIX terminal with anything but UTF-8 nowadays is rare. The application-level locale might be some legacy setting or "C" / "POSIX", but the terminal emulator is likely to use UTF-8, so overriding the locale to force using UTF-8 is generally an improvement.
Since _Py_device_encoding
in Windows is a low-level device query, not an application-level locale query, I think it should always return the encoding for a console file, which is the encoding that should be used with os.read
or os.write
. Also, the supported domain should be expanded beyond file descriptors 0-2. See _get_console_type
in winconsoleio.c for an example of using GetNumberOfConsoleInputEvents
to determine input vs output files. The latter should replace isatty
in Windows. This enhancement of _Py_device_encoding
would also support opening "CONIN$" and "CONOUT$" in legacy mode with the correct default encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use b.p.o for such discussion.
PR is place for code review.
I have againsted overriding os.device_encoding() already in b.p.o.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If UTF-8 mode doesn't override PYTHONLEGACYWINDOWSSTDIO, then this ends up using UTF-8 with io.FileIO instances opened for console files. This will read mixed garbage and "surrogatepass" from console input and print garbage to the screen.
I wrote this change with Unix in mind. On Unix, a terminal has no encoding. Only the LC_CTYPE locale matters. But the purpose of the UTF-8 Mode is to ignore the LC_CTYPE locale (the "locale encoding") on purpose.
I would be perfectly fine with keep the current behavior on Windows. It's fine to have a different behavior depending on the platform.
If user want to redirect stdout to file and use UTF-8 in the file, combining PYTHONLEGACYWINDOWSSTDIO and PYTHONUTF8 make sense, although it seems really ugly hack.
On Windows, os.device_encoding(fd) returns None if fd > 2 or if the file descriptor is not a Windows console. I expect that if you redirect stdout into a file, os.device_encoding(sys.stdout.fileno()) returns None.
@methane: If os.device_encoding(fd) returns None if fd is a file and not a console, would you be ok to no implement the UTF-8 Mode in device_encoding()?
device_encoding() is an important function since it is tested first by open() when the encoding is omitted! It is tested before using locale.getpreferredencoding(False).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I just againsted about overriding PYTHONLEGACYWINDOWSSTDIO option. After that, I againsted about overriding os.device_encoding() too in b.p.o.
On Windows, os.device_encoding(fd) returns None if fd > 2 or if the file descriptor is not a Windows console. I expect that if you redirect stdout into a file, os.device_encoding(sys.stdout.fileno()) returns None.
@methane: If os.device_encoding(fd) returns None if fd is a file and not a console, would you be ok to no implement the UTF-8 Mode in device_encoding()?
This hack replaces fd after TextIOWrapper is created. So os.device_encoding()
is called for console, but the returned encoding is used to writing to file.
Users may want to use UTF-8 mode to change default text file encoding used by open()
. If we enforce PYTHONLEGACYWINDOWSSTDIO=0, this (ugly) hack will be broken. That's why I againsted about overriding PYTHONLEGACYWINDOWSSTDIO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PYTHONLEGACYWINDOWSFSENCODING=1 disables the UTF-8 Mode:
https://docs.python.org/dev/c-api/init_config.html#c.PyPreConfig.legacy_windows_fs_encoding
Maybe we can do the same for PYTHONLEGACYWINDOWSSTDIO? If PYTHONLEGACYWINDOWSSTDIO=1 is used, turn of UTF-8 Mode?
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.legacy_windows_stdio
.. But anyway, I will leave the Windows implementation unchanged, since you wrote:
I don't think UTF-8 mode should override os.device_encoding() on Windows.
https://bugs.python.org/issue42236#msg380261
:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hack replaces fd after TextIOWrapper is created. So
os.device_encoding()
is called for console, but the returned encoding is used to writing to file.
Since 3.8, the sys.std*
files do not have the correct code page in legacy mode. Regardless of how legacy mode is used by projects, what it MUST implement is the legacy behavior of 3.5. sys.stdin
for console input should use the console input code page, and sys.stdout
and sys.stderr
for console output should use the console output code page.
New behavior is also needed to support file descriptors above 2 in _Py_device_encoding
in all cases. If it's desired for legacy mode to continue to be wrong in this case for the sake of compatibility, that should be implemented as a higher-level policy in TextIOWrapper
, not by unnecessarily limiting the usage of os.device_encoding
in a way that's not indicated by the doc string or documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 3.8, the sys.std* files do not have the correct code page in legacy mode.
I'm not aware of this issue. Would you mind to open a separate issue at bugs.python.org?
On Unix, the os.device_encoding() function now returns 'UTF-8' rather than the device encoding if the Python UTF-8 Mode is enabled.
Example showing the impact of this change in practice, on the stdout TTY:
Python 3.9 output (old):
Python 3.10 output (new):
IMHO the new behavior is more consistent. If you want the old behavior, you can explicitly pass |
|
* master: bpo-42260: Add _PyInterpreterState_SetConfig() (pythonGH-23158) Disable peg generator tests when building with PGO (pythonGH-23141) bpo-1635741: _sqlite3 uses PyModule_AddObjectRef() (pythonGH-23148) bpo-1635741: Fix PyInit_pyexpat() error handling (pythonGH-22489) bpo-42260: Main init modify sys.flags in-place (pythonGH-23150) bpo-1635741: Fix ref leak in _PyWarnings_Init() error path (pythonGH-23151) bpo-1635741: _ast uses PyModule_AddObjectRef() (pythonGH-23146) bpo-1635741: _contextvars uses PyModule_AddType() (pythonGH-23147) bpo-42260: Reorganize PyConfig (pythonGH-23149) bpo-1635741: Add PyModule_AddObjectRef() function (pythonGH-23122) bpo-42236: os.device_encoding() respects UTF-8 Mode (pythonGH-23119) bpo-42251: Add gettrace and getprofile to threading (pythonGH-23125) Enable signing of nuget.org packages and update to supported timestamp server (pythonGH-23132) Fix incorrect links in ast docs (pythonGH-23017) Add _PyType_GetModuleByDef (pythonGH-22835) Post 3.10.0a2 bpo-41796: Call _PyAST_Fini() earlier to fix a leak (pythonGH-23131) bpo-42249: Fix writing binary Plist files larger than 4 GiB. (pythonGH-23121) bpo-40077: Convert mmap.mmap static type to a heap type (pythonGH-23108) Python 3.10.0a2
On Unix, the os.device_encoding() function now returns 'UTF-8' rather than the device encoding if the Python UTF-8 Mode is enabled.
On Unix, the os.device_encoding() function now returns 'UTF-8' rather
than the device encoding if the Python UTF-8 Mode is enabled.
https://bugs.python.org/issue42236