Skip to content

Commit d51eaee

Browse files
methanevstinner
andauthored
PEP 624: Remove Py_UNICODE encoder APIs (#1497)
Co-authored-by: Victor Stinner <vstinner@python.org>
1 parent f1de4f1 commit d51eaee

File tree

1 file changed

+299
-0
lines changed

1 file changed

+299
-0
lines changed

pep-0624.rst

Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
PEP: 624
2+
Title: Remove Py_UNICODE encoder APIs
3+
Author: Inada Naoki <songofacandy@gmail.com>
4+
Status: Draft
5+
Type: Standards Track
6+
Content-Type: text/x-rst
7+
Created: 06-Jul-2020
8+
Python-Version: 3.11
9+
10+
11+
Abstract
12+
========
13+
14+
This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:
15+
16+
* ``PyUnicode_Encode()``
17+
* ``PyUnicode_EncodeASCII()``
18+
* ``PyUnicode_EncodeLatin1()``
19+
* ``PyUnicode_EncodeUTF7()``
20+
* ``PyUnicode_EncodeUTF8()``
21+
* ``PyUnicode_EncodeUTF16()``
22+
* ``PyUnicode_EncodeUTF32()``
23+
* ``PyUnicode_EncodeUnicodeEscape()``
24+
* ``PyUnicode_EncodeRawUnicodeEscape()``
25+
* ``PyUnicode_EncodeCharmap()``
26+
* ``PyUnicode_TranslateCharmap()``
27+
* ``PyUnicode_EncodeDecimal()``
28+
* ``PyUnicode_TransformDecimalToASCII()``
29+
30+
.. note::
31+
32+
`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
33+
Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
34+
is not relating to Unicode object. These PEPs are split because they have
35+
different motivation and need different discussion.
36+
37+
38+
Motivation
39+
==========
40+
41+
In general, reducing the number of APIs that have been deprecated for
42+
a long time and have few users is a good idea for not only it
43+
improves the maintainability of CPython, but it also helps API users
44+
and other Python implementations.
45+
46+
47+
Rationale
48+
=========
49+
50+
Deprecated since Python 3.3
51+
---------------------------
52+
53+
``Py_UNICODE`` and APIs using it are deprecated since Python 3.3.
54+
55+
56+
Inefficient
57+
-----------
58+
59+
All of these APIs are implemented using ``PyUnicode_FromWideChar``.
60+
So these APIs are inefficient when user want to encode Unicode
61+
object.
62+
63+
64+
Not used widely
65+
---------------
66+
67+
When searching from top 4000 PyPI packages [1]_, only pyodbc use
68+
these APIs.
69+
70+
* ``PyUnicode_EncodeUTF8()``
71+
* ``PyUnicode_EncodeUTF16()``
72+
73+
pyodbc uses these APIs to encode Unicode object into bytes object.
74+
So it is easy to fix it. [2]_
75+
76+
77+
Alternative APIs
78+
================
79+
80+
There are alternative APIs to accept ``PyObject *unicode`` instead of
81+
``Py_UNICODE *``. Users can migrate to them.
82+
83+
84+
========================================= ==========================================
85+
Deprecated API Alternative APIs
86+
========================================= ==========================================
87+
``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()``
88+
``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1)
89+
``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1)
90+
``PyUnicode_EncodeUTF7()`` \(2)
91+
``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1)
92+
``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3)
93+
``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3)
94+
``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()``
95+
``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()``
96+
``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1)
97+
``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()``
98+
``PyUnicode_EncodeDecimal()`` \(4)
99+
``PyUnicode_TransformDecimalToASCII()`` \(4)
100+
========================================= ==========================================
101+
102+
Notes:
103+
104+
(1)
105+
``const char *errors`` parameter is missing.
106+
107+
(2)
108+
There is no public alternative API. But user can use generic
109+
``PyUnicode_AsEncodedString()`` instead.
110+
111+
(3)
112+
``const char *errors, int byteorder`` parameters are missing.
113+
114+
(4)
115+
There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
116+
can be used instead. CPython uses
117+
``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
118+
from Unicode to numbers instead.
119+
120+
121+
Plan
122+
====
123+
124+
Python 3.9
125+
----------
126+
127+
Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
128+
already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
129+
already.
130+
131+
* ``PyUnicode_EncodeDecimal()``
132+
* ``PyUnicode_TransformDecimalToASCII()``.
133+
134+
Document all APIs as "will be removed in version 3.11".
135+
136+
137+
Python 3.11
138+
-----------
139+
140+
These APIs are removed.
141+
142+
* ``PyUnicode_Encode()``
143+
* ``PyUnicode_EncodeASCII()``
144+
* ``PyUnicode_EncodeLatin1()``
145+
* ``PyUnicode_EncodeUTF7()``
146+
* ``PyUnicode_EncodeUTF8()``
147+
* ``PyUnicode_EncodeUTF16()``
148+
* ``PyUnicode_EncodeUTF32()``
149+
* ``PyUnicode_EncodeUnicodeEscape()``
150+
* ``PyUnicode_EncodeRawUnicodeEscape()``
151+
* ``PyUnicode_EncodeCharmap()``
152+
* ``PyUnicode_TranslateCharmap()``
153+
* ``PyUnicode_EncodeDecimal()``
154+
* ``PyUnicode_TransformDecimalToASCII()``
155+
156+
157+
Alternative ideas
158+
=================
159+
160+
Instead of just removing deprecated APIs, we may be able to use thier
161+
names with different signature.
162+
163+
164+
Make some private APIs public
165+
------------------------------
166+
167+
``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
168+
169+
Some APIs have alternative public APIs. But they are missing
170+
``const char *errors`` or ``int byteorder`` parameters.
171+
172+
We can rename some private APIs and make them public to cover missing
173+
APIs and parameters.
174+
175+
============================= ================================
176+
Rename to Rename from
177+
============================= ================================
178+
``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()``
179+
``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()``
180+
``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()``
181+
``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()``
182+
``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()``
183+
``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()``
184+
============================= ================================
185+
186+
Pros:
187+
188+
* We have more consistent API set.
189+
190+
Cons:
191+
192+
* We have more public APIs to maintain.
193+
* Existing public APIs are enough for most use cases, and
194+
``PyUnicode_AsEncodedString()`` can be used in other cases.
195+
196+
197+
Replace ``Py_UNICODE*`` with ``Py_UCS4*``
198+
-----------------------------------------
199+
200+
We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
201+
``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
202+
convert ``Py_UCS4*`` string to Unicode object.
203+
204+
205+
Pros:
206+
207+
* We have more consistent API set.
208+
* User can encode UCS-4 string in C without creating Unicode object.
209+
210+
Cons:
211+
212+
* We have more public APIs to maintain.
213+
* Applications which uses UTF-8 or UTF-32 can not use these APIs
214+
anyway.
215+
* Other Python implementations may not have builtin codec for UCS-4.
216+
* If we change the Unicode internal representation to UTF-8, we need
217+
to keep UCS-4 support only for these APIs.
218+
219+
220+
Replace ``Py_UNICODE*`` with ``wchar_t*``
221+
-----------------------------------------
222+
223+
We can replace ``Py_UNICODE`` to ``wchar_t``.
224+
225+
Pros:
226+
227+
* We have more consistent API set.
228+
* Backward compatible.
229+
230+
Cons:
231+
232+
* We have more public APIs to maintain.
233+
* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
234+
because built-in codecs supports only UCS-1, UCS-2, and UCS-4
235+
input.
236+
237+
238+
Rejected ideas
239+
==============
240+
241+
Using runtime warning
242+
---------------------
243+
244+
These APIs doesn't release GIL for now. Emitting a warning from
245+
such APIs is not safe. See this example.
246+
247+
.. code-block::
248+
249+
PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference.
250+
PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
251+
PyUnicode_GET_SIZE(u), NULL);
252+
// Assumes u is still living reference.
253+
PyObject *t = PyTuple_Pack(2, u, b);
254+
Py_DECREF(b);
255+
return t;
256+
257+
If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
258+
filters and other threads may change the ``list`` and ``u`` can be
259+
a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
260+
261+
Additionally, since we are not changing behavior but removing C APIs,
262+
runtime ``DeprecationWarning`` might not helpful for Python
263+
developers. We should warn to extension developers instead.
264+
265+
266+
Discussions
267+
===========
268+
269+
* `Plan to remove Py_UNICODE APis except PEP 623
270+
<https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
271+
* `bpo-41123: Remove Py_UNICODE APIs except PEP 623: <https://bugs.python.org/issue41123>`_
272+
273+
274+
References
275+
==========
276+
277+
.. [1] Source package list chosen from top 4000 PyPI packages.
278+
(https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)
279+
280+
.. [2] pyodbc -- Don't use PyUnicode_Encode API #792
281+
(https://github.com/mkleehammer/pyodbc/pull/792)
282+
283+
.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318)
284+
(https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)
285+
286+
287+
Copyright
288+
=========
289+
290+
This document has been placed in the public domain.
291+
292+
..
293+
Local Variables:
294+
mode: indented-text
295+
indent-tabs-mode: nil
296+
sentence-end-double-space: t
297+
fill-column: 70
298+
coding: utf-8
299+
End:

0 commit comments

Comments
 (0)