Skip to content

[C API] PEP 756: Add PyUnicode_Export() and PyUnicode_Import() functions #119609

Closed as not planned
@vstinner

Description

@vstinner

Feature or enhancement

PEP 393 – Flexible String Representation changed the Unicode implementation in Python 3.3 to use 3 string "kinds":

  • PyUnicode_KIND_1BYTE (UCS-1): ASCII and Latin1, [U+0000; U+00ff] range.
  • PyUnicode_KIND_2BYTE (UCS-2): BMP, [U+0000; U+ffff] range.
  • PyUnicode_KIND_4BYTE (UCZ-4): Full Unicode Character Set, [U+0000; U+10ffff] range.

Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.

Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.

This implementation is not leaked in the limited C API. For example, the PyUnicode_FromKindAndData() function is excluded from the stable ABI. Said differently, it's not possible to write efficient code for PEP 393 using the limited C API.


I propose adding two functions:

  • PyUnicode_AsNativeFormat(): export to the native format
  • PyUnicode_FromNativeFormat(): import from the native format

These functions are added to the limited C API version 3.14.

Native formats (new constants):

  • PyUnicode_NATIVE_ASCII: ASCII string.
  • PyUnicode_NATIVE_UCS1: UCS-1 string.
  • PyUnicode_NATIVE_UCS2: UCS-2 string.
  • PyUnicode_NATIVE_UCS4: UCS-4 string.
  • PyUnicode_NATIVE_UTF8: UTF-8 string (CPython implementation detail: only supported for import, not used by export).

Differences with PyUnicode_FromKindAndData():

  • Size is a number of bytes. For example, a single UCS-2 character is counted as 2 bytes.
  • Add PyUnicode_NATIVE_ASCII and PyUnicode_NATIVE_UTF8 formats.

PyUnicode_NATIVE_ASCII format allows further optimizations.

PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.


API:

#define PyUnicode_NATIVE_ASCII 1
#define PyUnicode_NATIVE_UCS1 2
#define PyUnicode_NATIVE_UCS2 3
#define PyUnicode_NATIVE_UCS4 4
#define PyUnicode_NATIVE_UTF8 5

// Get the content of a string in its native format.
// - Return the content, set '*size' and '*native_format' on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(const void*) PyUnicode_AsNativeFormat(
    PyObject *unicode,
    Py_ssize_t *size,
    int *native_format);

// Create a string object from a native format string.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_FromNativeFormat(
    const void *data,
    Py_ssize_t size,
    int native_format);

See the attached pull request for more details.


This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files

Using the stable ABI would reduce the number of wheel packages and so ease their release process.

See src/markupsafe/_speedups.c: string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions