Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-105156: Deprecate the old Py_UNICODE type in C API #105157

Merged
merged 2 commits into from
Jun 1, 2023

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented May 31, 2023

Deprecate the old Py_UNICODE and PY_UNICODE_TYPE types in the C API: use wchar_t instead.

Replace Py_UNICODE with wchar_t in multiple C files.


📚 Documentation preview 📚: https://cpython-previews--105157.org.readthedocs.build/

Deprecate the old Py_UNICODE and PY_UNICODE_TYPE types in the C API:
use wchar_t instead.

Replace Py_UNICODE with wchar_t in multiple C files.
@vstinner
Copy link
Member Author

cc @methane

@methane
Copy link
Member

methane commented May 31, 2023

Sourcegraph results:

It seems two releases is not enough for removing Py_UNICODE. But let's see it two years later.

@vstinner
Copy link
Member Author

It seems two releases is not enough for removing Py_UNICODE. But let's see it two years later.

This PR is mostly about deprecation. I prefer to announce a Python release when these types will be removed, Python 3.15. But we will have to do this usage study again when these types will be removed for real.

The warning should help users to find old code still using Py_UNICODE by mistake or not.

@methane
Copy link
Member

methane commented May 31, 2023

@vstinner
Copy link
Member Author

Sourcegraph results: Py_UNICODE

The first result is Py_UNICODE *inp = PyUnicode_AS_UNICODE(in);. This code is already broken by Python 3.12: the function got removed.

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
@vstinner
Copy link
Member Author

Fix here too. https://github.com/python/cpython/pull/105157/files#file-modules-posixmodule-c-L5653

I planned to write a separated PR for code generated by Argument Clinic. It's now done with: PR #105161.

@vstinner
Copy link
Member Author

I will wait until they 2 other PRs of this issue will be merged, to avoid emitting new compiler warnings.

@arhadthedev
Copy link
Member

use wchar_t instead.

Can we use char16_t from С11? Docs: https://en.cppreference.com/w/c/string/multibyte/char16_t.

It would avoid 2-vs-4-byte size discrepancy.

@methane
Copy link
Member

methane commented May 31, 2023

Can we use char16_t from С11? Docs: https://en.cppreference.com/w/c/string/multibyte/char16_t.

It would avoid 2-vs-4-byte size discrepancy.

At where?

Py_UNICODE has been wchar_t since Python 3.3.
So user should use wchar_t where Py_UNICODE was required before.

Where Py_UNICODE was not required, my recommendation is "use UTF-8 always".

@arhadthedev
Copy link
Member

Ah, I got it that the parent issue is about removal of a thin thus unnecessary typedef, not about changing the multybyte machinery for the next major version of CPython.

@arhadthedev
Copy link
Member

Initially I've got an impression that the PEP-393 removal of Py_UNICODE leaves the C API without a wide character type at all (so we need to fill the gap with any other wide char type).

Now I see that this would require a PEP before the removal.

@vstinner
Copy link
Member Author

vstinner commented Jun 1, 2023

Can we use char16_t from С11?

That would be wrong. Python has many C functions which really expect 16-bit or 32-bit wchar_t like PyUnicode_FromWideChar().

Initially I've got an impression that the PEP-393 removal of Py_UNICODE leaves the C API without a wide character type at all

There is Py_UCS4 which should be 32-bit and is able to store all Unicode characters.

Where Py_UNICODE was not required, my recommendation is "use UTF-8 always".

Right. PEP 393 implementation first added many functions using Py_UCS4 arrays. It was inefficient since most of the time, all code points could be stored in Py_UCS1 arrays (4x smaller). Many strings are just ASCII. There are now more memory efficient structures. I also wrote _PyUnicodeWriter private API to change the internal storage depending on the maximum code point.

@vstinner vstinner merged commit 8ed705c into python:main Jun 1, 2023
@vstinner vstinner deleted the deprecate_py_unicode branch June 1, 2023 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants