|
| 1 | +PEP: 624 |
| 2 | +Title: Remove Py_UNICODE encoder APIs |
| 3 | +Author: Inada Naoki <songofacandy@gmail.com> |
| 4 | +Status: Draft |
| 5 | +Type: Standards Track |
| 6 | +Content-Type: text/x-rst |
| 7 | +Created: 06-Jul-2020 |
| 8 | +Python-Version: 3.11 |
| 9 | + |
| 10 | + |
| 11 | +Abstract |
| 12 | +======== |
| 13 | + |
| 14 | +This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11: |
| 15 | + |
| 16 | +* ``PyUnicode_Encode()`` |
| 17 | +* ``PyUnicode_EncodeASCII()`` |
| 18 | +* ``PyUnicode_EncodeLatin1()`` |
| 19 | +* ``PyUnicode_EncodeUTF7()`` |
| 20 | +* ``PyUnicode_EncodeUTF8()`` |
| 21 | +* ``PyUnicode_EncodeUTF16()`` |
| 22 | +* ``PyUnicode_EncodeUTF32()`` |
| 23 | +* ``PyUnicode_EncodeUnicodeEscape()`` |
| 24 | +* ``PyUnicode_EncodeRawUnicodeEscape()`` |
| 25 | +* ``PyUnicode_EncodeCharmap()`` |
| 26 | +* ``PyUnicode_TranslateCharmap()`` |
| 27 | +* ``PyUnicode_EncodeDecimal()`` |
| 28 | +* ``PyUnicode_TransformDecimalToASCII()`` |
| 29 | + |
| 30 | +.. note:: |
| 31 | + |
| 32 | + `PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove |
| 33 | + Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP |
| 34 | + is not relating to Unicode object. These PEPs are split because they have |
| 35 | + different motivation and need different discussion. |
| 36 | + |
| 37 | + |
| 38 | +Motivation |
| 39 | +========== |
| 40 | + |
| 41 | +In general, reducing the number of APIs that have been deprecated for |
| 42 | +a long time and have few users is a good idea for not only it |
| 43 | +improves the maintainability of CPython, but it also helps API users |
| 44 | +and other Python implementations. |
| 45 | + |
| 46 | + |
| 47 | +Rationale |
| 48 | +========= |
| 49 | + |
| 50 | +Deprecated since Python 3.3 |
| 51 | +--------------------------- |
| 52 | + |
| 53 | +``Py_UNICODE`` and APIs using it are deprecated since Python 3.3. |
| 54 | + |
| 55 | + |
| 56 | +Inefficient |
| 57 | +----------- |
| 58 | + |
| 59 | +All of these APIs are implemented using ``PyUnicode_FromWideChar``. |
| 60 | +So these APIs are inefficient when user want to encode Unicode |
| 61 | +object. |
| 62 | + |
| 63 | + |
| 64 | +Not used widely |
| 65 | +--------------- |
| 66 | + |
| 67 | +When searching from top 4000 PyPI packages [1]_, only pyodbc use |
| 68 | +these APIs. |
| 69 | + |
| 70 | +* ``PyUnicode_EncodeUTF8()`` |
| 71 | +* ``PyUnicode_EncodeUTF16()`` |
| 72 | + |
| 73 | +pyodbc uses these APIs to encode Unicode object into bytes object. |
| 74 | +So it is easy to fix it. [2]_ |
| 75 | + |
| 76 | + |
| 77 | +Alternative APIs |
| 78 | +================ |
| 79 | + |
| 80 | +There are alternative APIs to accept ``PyObject *unicode`` instead of |
| 81 | +``Py_UNICODE *``. Users can migrate to them. |
| 82 | + |
| 83 | + |
| 84 | +========================================= ========================================== |
| 85 | +Deprecated API Alternative APIs |
| 86 | +========================================= ========================================== |
| 87 | +``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()`` |
| 88 | +``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1) |
| 89 | +``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1) |
| 90 | +``PyUnicode_EncodeUTF7()`` \(2) |
| 91 | +``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1) |
| 92 | +``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3) |
| 93 | +``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3) |
| 94 | +``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()`` |
| 95 | +``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()`` |
| 96 | +``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1) |
| 97 | +``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()`` |
| 98 | +``PyUnicode_EncodeDecimal()`` \(4) |
| 99 | +``PyUnicode_TransformDecimalToASCII()`` \(4) |
| 100 | +========================================= ========================================== |
| 101 | + |
| 102 | +Notes: |
| 103 | + |
| 104 | +(1) |
| 105 | + ``const char *errors`` parameter is missing. |
| 106 | + |
| 107 | +(2) |
| 108 | + There is no public alternative API. But user can use generic |
| 109 | + ``PyUnicode_AsEncodedString()`` instead. |
| 110 | + |
| 111 | +(3) |
| 112 | + ``const char *errors, int byteorder`` parameters are missing. |
| 113 | + |
| 114 | +(4) |
| 115 | + There is no direct replacement. But ``Py_UNICODE_TODECIMAL`` |
| 116 | + can be used instead. CPython uses |
| 117 | + ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting |
| 118 | + from Unicode to numbers instead. |
| 119 | + |
| 120 | + |
| 121 | +Plan |
| 122 | +==== |
| 123 | + |
| 124 | +Python 3.9 |
| 125 | +---------- |
| 126 | + |
| 127 | +Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed |
| 128 | +already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)`` |
| 129 | +already. |
| 130 | + |
| 131 | +* ``PyUnicode_EncodeDecimal()`` |
| 132 | +* ``PyUnicode_TransformDecimalToASCII()``. |
| 133 | + |
| 134 | +Document all APIs as "will be removed in version 3.11". |
| 135 | + |
| 136 | + |
| 137 | +Python 3.11 |
| 138 | +----------- |
| 139 | + |
| 140 | +These APIs are removed. |
| 141 | + |
| 142 | +* ``PyUnicode_Encode()`` |
| 143 | +* ``PyUnicode_EncodeASCII()`` |
| 144 | +* ``PyUnicode_EncodeLatin1()`` |
| 145 | +* ``PyUnicode_EncodeUTF7()`` |
| 146 | +* ``PyUnicode_EncodeUTF8()`` |
| 147 | +* ``PyUnicode_EncodeUTF16()`` |
| 148 | +* ``PyUnicode_EncodeUTF32()`` |
| 149 | +* ``PyUnicode_EncodeUnicodeEscape()`` |
| 150 | +* ``PyUnicode_EncodeRawUnicodeEscape()`` |
| 151 | +* ``PyUnicode_EncodeCharmap()`` |
| 152 | +* ``PyUnicode_TranslateCharmap()`` |
| 153 | +* ``PyUnicode_EncodeDecimal()`` |
| 154 | +* ``PyUnicode_TransformDecimalToASCII()`` |
| 155 | + |
| 156 | + |
| 157 | +Alternative ideas |
| 158 | +================= |
| 159 | + |
| 160 | +Instead of just removing deprecated APIs, we may be able to use thier |
| 161 | +names with different signature. |
| 162 | + |
| 163 | + |
| 164 | +Make some private APIs public |
| 165 | +------------------------------ |
| 166 | + |
| 167 | +``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs. |
| 168 | + |
| 169 | +Some APIs have alternative public APIs. But they are missing |
| 170 | +``const char *errors`` or ``int byteorder`` parameters. |
| 171 | + |
| 172 | +We can rename some private APIs and make them public to cover missing |
| 173 | +APIs and parameters. |
| 174 | + |
| 175 | +============================= ================================ |
| 176 | + Rename to Rename from |
| 177 | +============================= ================================ |
| 178 | +``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()`` |
| 179 | +``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()`` |
| 180 | +``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()`` |
| 181 | +``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()`` |
| 182 | +``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()`` |
| 183 | +``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()`` |
| 184 | +============================= ================================ |
| 185 | + |
| 186 | +Pros: |
| 187 | + |
| 188 | +* We have more consistent API set. |
| 189 | + |
| 190 | +Cons: |
| 191 | + |
| 192 | +* We have more public APIs to maintain. |
| 193 | +* Existing public APIs are enough for most use cases, and |
| 194 | + ``PyUnicode_AsEncodedString()`` can be used in other cases. |
| 195 | + |
| 196 | + |
| 197 | +Replace ``Py_UNICODE*`` with ``Py_UCS4*`` |
| 198 | +----------------------------------------- |
| 199 | + |
| 200 | +We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with |
| 201 | +``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to |
| 202 | +convert ``Py_UCS4*`` string to Unicode object. |
| 203 | + |
| 204 | + |
| 205 | +Pros: |
| 206 | + |
| 207 | +* We have more consistent API set. |
| 208 | +* User can encode UCS-4 string in C without creating Unicode object. |
| 209 | + |
| 210 | +Cons: |
| 211 | + |
| 212 | +* We have more public APIs to maintain. |
| 213 | +* Applications which uses UTF-8 or UTF-32 can not use these APIs |
| 214 | + anyway. |
| 215 | +* Other Python implementations may not have builtin codec for UCS-4. |
| 216 | +* If we change the Unicode internal representation to UTF-8, we need |
| 217 | + to keep UCS-4 support only for these APIs. |
| 218 | + |
| 219 | + |
| 220 | +Replace ``Py_UNICODE*`` with ``wchar_t*`` |
| 221 | +----------------------------------------- |
| 222 | + |
| 223 | +We can replace ``Py_UNICODE`` to ``wchar_t``. |
| 224 | + |
| 225 | +Pros: |
| 226 | + |
| 227 | +* We have more consistent API set. |
| 228 | +* Backward compatible. |
| 229 | + |
| 230 | +Cons: |
| 231 | + |
| 232 | +* We have more public APIs to maintain. |
| 233 | +* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is |
| 234 | + because built-in codecs supports only UCS-1, UCS-2, and UCS-4 |
| 235 | + input. |
| 236 | + |
| 237 | + |
| 238 | +Rejected ideas |
| 239 | +============== |
| 240 | + |
| 241 | +Using runtime warning |
| 242 | +--------------------- |
| 243 | + |
| 244 | +These APIs doesn't release GIL for now. Emitting a warning from |
| 245 | +such APIs is not safe. See this example. |
| 246 | + |
| 247 | +.. code-block:: |
| 248 | +
|
| 249 | + PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference. |
| 250 | + PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), |
| 251 | + PyUnicode_GET_SIZE(u), NULL); |
| 252 | + // Assumes u is still living reference. |
| 253 | + PyObject *t = PyTuple_Pack(2, u, b); |
| 254 | + Py_DECREF(b); |
| 255 | + return t; |
| 256 | +
|
| 257 | +If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning |
| 258 | +filters and other threads may change the ``list`` and ``u`` can be |
| 259 | +a dangling reference after ``PyUnicode_EncodeUTF8()`` returned. |
| 260 | + |
| 261 | +Additionally, since we are not changing behavior but removing C APIs, |
| 262 | +runtime ``DeprecationWarning`` might not helpful for Python |
| 263 | +developers. We should warn to extension developers instead. |
| 264 | + |
| 265 | + |
| 266 | +Discussions |
| 267 | +=========== |
| 268 | + |
| 269 | +* `Plan to remove Py_UNICODE APis except PEP 623 |
| 270 | + <https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_ |
| 271 | +* `bpo-41123: Remove Py_UNICODE APIs except PEP 623: <https://bugs.python.org/issue41123>`_ |
| 272 | + |
| 273 | + |
| 274 | +References |
| 275 | +========== |
| 276 | + |
| 277 | +.. [1] Source package list chosen from top 4000 PyPI packages. |
| 278 | + (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt) |
| 279 | +
|
| 280 | +.. [2] pyodbc -- Don't use PyUnicode_Encode API #792 |
| 281 | + (https://github.com/mkleehammer/pyodbc/pull/792) |
| 282 | +
|
| 283 | +.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318) |
| 284 | + (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181) |
| 285 | +
|
| 286 | +
|
| 287 | +Copyright |
| 288 | +========= |
| 289 | + |
| 290 | +This document has been placed in the public domain. |
| 291 | + |
| 292 | +.. |
| 293 | + Local Variables: |
| 294 | + mode: indented-text |
| 295 | + indent-tabs-mode: nil |
| 296 | + sentence-end-double-space: t |
| 297 | + fill-column: 70 |
| 298 | + coding: utf-8 |
| 299 | + End: |
0 commit comments