[C API] PEP 756: Add PyUnicode_Export() and PyUnicode_Import() functions

# Feature or enhancement

[PEP 393 – Flexible String Representation](https://peps.python.org/pep-0393/) changed the Unicode implementation in Python 3.3 to use 3 string "kinds":

* `PyUnicode_KIND_1BYTE` (UCS-1): ASCII and Latin1, [U+0000; U+00ff] range.
* `PyUnicode_KIND_2BYTE` (UCS-2): BMP, [U+0000; U+ffff] range.
* `PyUnicode_KIND_4BYTE` (UCZ-4): Full Unicode Character Set, [U+0000; U+10ffff] range.

Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.

Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.

This implementation is not leaked in the limited C API. For example, the `PyUnicode_FromKindAndData()` function is excluded from the stable ABI. Said differently, **it's not possible to write efficient code for PEP 393 using the limited C API.**

---

I propose adding two functions:

* `PyUnicode_AsNativeFormat()`: export to the native format
* `PyUnicode_FromNativeFormat()`: import from the native format

These functions are added to the limited C API version 3.14.

Native formats (new constants):

* `PyUnicode_NATIVE_ASCII`: ASCII string.
* `PyUnicode_NATIVE_UCS1`: UCS-1 string.
* `PyUnicode_NATIVE_UCS2`: UCS-2 string.
* `PyUnicode_NATIVE_UCS4`: UCS-4 string.
* `PyUnicode_NATIVE_UTF8`: UTF-8 string (CPython implementation detail: only supported for import, not used by export).

Differences with `PyUnicode_FromKindAndData()`:

* Size is a number of bytes. For example, a single UCS-2 character is counted as 2 bytes.
* Add PyUnicode_NATIVE_ASCII and PyUnicode_NATIVE_UTF8 formats.

PyUnicode_NATIVE_ASCII format allows further optimizations.

PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.

---

API:

```c
#define PyUnicode_NATIVE_ASCII 1
#define PyUnicode_NATIVE_UCS1 2
#define PyUnicode_NATIVE_UCS2 3
#define PyUnicode_NATIVE_UCS4 4
#define PyUnicode_NATIVE_UTF8 5

// Get the content of a string in its native format.
// - Return the content, set '*size' and '*native_format' on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(const void*) PyUnicode_AsNativeFormat(
    PyObject *unicode,
    Py_ssize_t *size,
    int *native_format);

// Create a string object from a native format string.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_FromNativeFormat(
    const void *data,
    Py_ssize_t size,
    int native_format);
```

See the attached pull request for more details.

---

This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files

Using the stable ABI would reduce the number of wheel packages and so ease their release process.

See [src/markupsafe/_speedups.c](https://github.com/pallets/markupsafe/blob/main/src/markupsafe/_speedups.c): string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).


### Linked PRs
* gh-119610
* gh-123738

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[C API] PEP 756: Add PyUnicode_Export() and PyUnicode_Import() functions #119609

Feature or enhancement

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[C API] PEP 756: Add PyUnicode_Export() and PyUnicode_Import() functions #119609

Description

Feature or enhancement

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions