Description
Feature or enhancement
PEP 393 – Flexible String Representation changed the Unicode implementation in Python 3.3 to use 3 string "kinds":
PyUnicode_KIND_1BYTE
(UCS-1): ASCII and Latin1, [U+0000; U+00ff] range.PyUnicode_KIND_2BYTE
(UCS-2): BMP, [U+0000; U+ffff] range.PyUnicode_KIND_4BYTE
(UCZ-4): Full Unicode Character Set, [U+0000; U+10ffff] range.
Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.
Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.
This implementation is not leaked in the limited C API. For example, the PyUnicode_FromKindAndData()
function is excluded from the stable ABI. Said differently, it's not possible to write efficient code for PEP 393 using the limited C API.
I propose adding two functions:
PyUnicode_AsNativeFormat()
: export to the native formatPyUnicode_FromNativeFormat()
: import from the native format
These functions are added to the limited C API version 3.14.
Native formats (new constants):
PyUnicode_NATIVE_ASCII
: ASCII string.PyUnicode_NATIVE_UCS1
: UCS-1 string.PyUnicode_NATIVE_UCS2
: UCS-2 string.PyUnicode_NATIVE_UCS4
: UCS-4 string.PyUnicode_NATIVE_UTF8
: UTF-8 string (CPython implementation detail: only supported for import, not used by export).
Differences with PyUnicode_FromKindAndData()
:
- Size is a number of bytes. For example, a single UCS-2 character is counted as 2 bytes.
- Add PyUnicode_NATIVE_ASCII and PyUnicode_NATIVE_UTF8 formats.
PyUnicode_NATIVE_ASCII format allows further optimizations.
PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.
API:
#define PyUnicode_NATIVE_ASCII 1
#define PyUnicode_NATIVE_UCS1 2
#define PyUnicode_NATIVE_UCS2 3
#define PyUnicode_NATIVE_UCS4 4
#define PyUnicode_NATIVE_UTF8 5
// Get the content of a string in its native format.
// - Return the content, set '*size' and '*native_format' on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(const void*) PyUnicode_AsNativeFormat(
PyObject *unicode,
Py_ssize_t *size,
int *native_format);
// Create a string object from a native format string.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_FromNativeFormat(
const void *data,
Py_ssize_t size,
int native_format);
See the attached pull request for more details.
This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files
Using the stable ABI would reduce the number of wheel packages and so ease their release process.
See src/markupsafe/_speedups.c: string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).