Description
I propose to change the PyUnicode_AsUTF8()
API to raise an exception and return NULL if the string contains embedded null characters.
If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using char*
since a null byte is treated as the terminator: marker of the string end. Truncating a string silently is a bad practice and can lead to different bugs including security vulnerabilities.
In practice, the minority of impacted C extensions and impacted users should benefit of such backward incompatible change, since truncating a string silently is a bad practice. Impacted users can use PyUnicode_AsUTF8AndSize(obj, NULL)
and just ignore the size if they want to truncate on purpose.
It would address the following "hidden" comment on PyUnicode_AsUTF8():
Use of this API is DEPRECATED since no size information can be
extracted from the returned data.
PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.
In the recently added PyUnicode_EqualToUTF8(obj, str), str is treated as not equal if obj contains embedded null characters.
The folllowing functions already raise an exception if the string contains embedded null characters or bytes:
- PyUnicode_AsWideCharString()
- PyUnicode_EncodeLocale()
- PyUnicode_EncodeFSDefault()
- PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
- PyUnicode_DecodeFSDefaultAndSize()
- PyUnicode_FSConverter()
- PyUnicode_FSDecoder()
PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.
PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.
Linked PRs
- gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL #111091
- gh-111089: PyUnicode_AsUTF8AndSize() sets size on error #111106
- gh-111089: Add PyUnicode_AsUTF8() to the limited C API #111121
- gh-111089: Use PyUnicode_AsUTF8() in sqlite3 #111122
- gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic #111585
- gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587
- gh-111089: Use PyUnicode_AsUTF8() in getargs.c #111620
- gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672
- gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688
- gh-111089: Revert PyUnicode_AsUTF8() changes #111833