-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-29240: PEP 540: Add a new UTF-8 mode #855
Conversation
@Haypo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @zooba, @benjaminp and @serhiy-storchaka to be potential reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see some improved documentation before this is merged, but I think the rest of the change looks okay.
Doc/library/sys.rst
Outdated
@@ -463,6 +468,10 @@ always available. | |||
Windows is no longer guaranteed to return ``'mbcs'``. See :pep:`529` | |||
and :func:`_enablelegacywindowsfsencoding` for more information. | |||
|
|||
.. versionchanged:: 3.7 | |||
The UTF-8 mode can now changes the encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either: "UTF-8 mode can now change the encoding" or "UTF-8 mode now changes the encoding", neither of which are very good at explaining what the impact is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the point that specifying both UTF-8 mode and legacy encoding is not supported? Or specifying UTF-8 mode will override the legacy encoding setting?
Doc/using/cmdline.rst
Outdated
@@ -423,6 +424,9 @@ Miscellaneous options | |||
.. versionadded:: 3.6 | |||
The ``-X showalloccount`` option. | |||
|
|||
.. versionchanged:: 3.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"versionadded"
@@ -28,6 +28,10 @@ PyAPI_DATA(const char *) Py_FileSystemDefaultEncodeErrors; | |||
#endif | |||
PyAPI_DATA(int) Py_HasFileSystemDefaultEncoding; | |||
|
|||
#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x03070000 | |||
PyAPI_DATA(int) Py_UTF8Mode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an important piece of information? Is it so important that it has to be in the limited API? What would I use this for?
Modules/main.c
Outdated
break; | ||
} | ||
else if (c == 'X') { | ||
if (wcscmp(_PyOS_optarg, L"utf8") == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wcsicmp
? Any need to be case-sensitive here?
Ping @ncoghlan - more changes to startup sequence :) |
FYI I didn't look again at my code. I only recreated this pull request to not loose the code, since my old PR was on the old CPython repository. I still have to update the PEP and post it to python-dev. Questions like "wcsicmp? Any need to be case-sensitive here?" should be asked on the PEP directly, not on the implementation. You should wait until the PEP is accepted before reviewing the implementation. The PEP may still change. |
@Haypo Ah fair enough :) I haven't been paying attention to all the email recently, so I figured it was accepted already. |
The implementation is not complete: the command line arguments and environment variables are not re-decoded from UTF-8 once Py_Main() detects that the user requested the UTF-8 mode (-X utf8 or PYTHONUTF8). But thanks to my refactoring work on Modules/main.c ( bugs.python.org/issue32030 ), fixing this issue should now be much simpler. |
* Add -X utf8 command line option, PYTHONUTF8 environment variable and a new sys.flags.utf8_mode flag. * If the LC_CTYPE locale is "C" at startup: enable automatically the UTF-8 mode. * Add _winapi.GetACP(). encodings._alias_mbcs() now calls _winapi.GetACP() to get the ANSI code page * locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 mode. As a side effect, open() now uses the UTF-8 encoding by default in this mode. * Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding in the UTF-8 mode. * Update subprocess._args_from_interpreter_flags() to handle -X utf8 * Skip some tests relying on the current locale if the UTF-8 mode is enabled. * Add test_utf8mode.py. * _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to return also the length (number of wide characters). * pymain_get_global_config() and pymain_set_global_config() now always copy flag values, rather than only copying if the new value is greater than the old value.
I updated the PR for the 4th version of the PEP 540: getpreferredencoding() now returns UTF-8 in the UTF-8 Mode. |
Programs/python.c
Outdated
Py_UTF8Mode = 1; | ||
} | ||
setlocale(LC_CTYPE, ctype); | ||
PyMem_RawFree(ctype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this setlocale() for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line 64 changes the locale to the user LC_CTYPE locale. This line restores the locale to its previous value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't previous value be got by setlocale(LC_CTYPE, NULL)
instead of setlocale(LC_CTYPE, "")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right. I fixed it in a new commit.
Fix conflict in Modules/main.c: pymain_set_global_config() and _Py_CheckHashBasedPycsMode.
Mock _winapi.GetACP(), not _bootlocale.getpreferredencoding().
@methane: You approved my PEP. This is implementation, all tests pass on Linux and Windows. While the implementation is not perfect, I propose to merge it right now to be get more time before the Python 3.7 final to test it. Remaining issue: when the UTF-8 mode is enabled manually (-X utf8), the command line arguments and environment variables are decoded from the locale encoding instead of being decoded from UTF-8. The fix should be easy, but I didn't have time yet to implement it. Another less important issue: the code to check if the LC_CTYPE locale is the POSIX locale is not currently shared between locale coercion (PEP 538) and UTF-8 Mode (PEP 540). |
Programs/python.c
Outdated
/* UTF-8 mode */ | ||
char *old_ctype = setlocale(LC_CTYPE, NULL); | ||
if (old_ctype != NULL) { | ||
old_ctype = _PyMem_RawStrdup(old_ctype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, it can be:
if (_Py_LegacyLocaleDetected()) {
Py_UTF8Mode = 1;
_Py_CoerceLegacyLocale();
}
See here
Lines 75 to 77 in 4ae06c5
if (_Py_LegacyLocaleDetected()) { | |
_Py_CoerceLegacyLocale(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"if (_Py_LegacyLocaleDetected()) { Py_UTF8Mode = 1;"
Right. It works as expected, I updated my PR.
I feel adding more code to And when |
It is.
main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric Snow and me are working on making this code better. See for example https://bugs.python.org/issue32030 Currently, Py_Main() (Modules/main.c) and _PyPathConfig_Calculate() (Modules/getpath.c and PC/getpathp.c) are fully implemented with wchar_t*. Parsing the command line options using char* on Unix but wchar_t* would make the code more complex. I expect that many lines of code would have to be duplicate, one version for char*, one version for wchar_t*. For all these reasons, I propose to merge this uncomplete PR and write a different PR for the most complex part, re-encode wchar_t* command line arguments, implement Py_UnixMain() or another even better option? |
Oops, I forgot to push my main.c change. It's now done. |
mode
using -X utf8 or -X utf8=strict command line options.
enabled.
return also the length (number of wide characters).
mode
https://bugs.python.org/issue29240