Skip to content

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

Closed
@vstinner

Description

@vstinner

I propose to change the PyUnicode_AsUTF8() API to raise an exception and return NULL if the string contains embedded null characters.

If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using char* since a null byte is treated as the terminator: marker of the string end. Truncating a string silently is a bad practice and can lead to different bugs including security vulnerabilities.

In practice, the minority of impacted C extensions and impacted users should benefit of such backward incompatible change, since truncating a string silently is a bad practice. Impacted users can use PyUnicode_AsUTF8AndSize(obj, NULL) and just ignore the size if they want to truncate on purpose.

It would address the following "hidden" comment on PyUnicode_AsUTF8():

Use of this API is DEPRECATED since no size information can be
extracted from the returned data.

PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.

In the recently added PyUnicode_EqualToUTF8(obj, str), str is treated as not equal if obj contains embedded null characters.

The folllowing functions already raise an exception if the string contains embedded null characters or bytes:

  • PyUnicode_AsWideCharString()
  • PyUnicode_EncodeLocale()
  • PyUnicode_EncodeFSDefault()
  • PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
  • PyUnicode_DecodeFSDefaultAndSize()
  • PyUnicode_FSConverter()
  • PyUnicode_FSDecoder()

PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.

PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions