Description
I just modified PyUnicode_AsUTF8() of the C API to raise an exception if a string contains an embedded null character to reduce the risk of security vulnerabilities. PyUnicode_AsUTF8() caller expects a string terminated by a null byte. If the UTF-8 encoded string contains embedded null byte, the caller is likely to truncate the string without knowing that there are more bytes after "the first" null byte.
See: https://owasp.org/www-community/attacks/Embedding_Null_Code
It's not only about security issue, it can also just be seen as a bug: unwanted behavior.
Previous issues:
- [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089
- _winapi.LCMapStringEx fails when encountering a string containing null characters #106844
- os.path.normpath truncates input on null bytes in 3.11, but not 3.10 #106242 -- CVE-2023-41105
- Uncaught exception in
http.server
request handling (<=3.10) #103223 - embedded null byte when connecting to sqlite database using a bytes object #84335
- os.path.exists should not throw "Embedded NUL character" exception #73228
- "embedded NUL character" exceptions #66411
- sqlite3 doesn't complain if the request contains a null character #65346
- Reject embedded null characters in wchar* strings #57826
Discussions:
- 2014: https://mail.python.org/archives/list/python-dev@python.org/thread/MZDL7FZZMRSW5MTIHLSA6ANNMCV7EEZN/
Example with Python 3.12:
import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
printf = libc.printf
PyUnicode_AsUTF8 = ctypes.pythonapi.PyUnicode_AsUTF8
PyUnicode_AsUTF8.argtypes = (ctypes.py_object,)
PyUnicode_AsUTF8.restype = ctypes.c_char_p
my_string = "World\0truncated string"
printf(b"Hello %s\n", PyUnicode_AsUTF8(my_string))
Output:
Hello World
The truncated string
part is silently ignored!
Multiple functions were modified in the past to prevent this problem. Examples:
- _dbm.open(): check filename
- _gdbm.open(): check filename
PyBytes_AsStringAndSize(str, NULL)
- grp.getgrnam(): check name
- pwd.getpwnam(): check name
- _locale.strxfrm(): check argument
- path_converter() of the os module: basically any filename and path
- PyUnicode_AsWideCharString()
- os.putenv()
- _posixsubprocess.fork_exec(): executable_list
- _struct.Struct: check format
- _tkinter SetVar() and varname_converter()
- _winapi.CreateProcess() getenvironment()
- PyUnicode_EncodeLocale()
- PyUnicode_EncodeFSDefault()
- unicode_decode_locale()
- PyUnicode_FSConverter()
- PyUnicode_DecodeLocale()
- PyUnicode_DecodeLocaleAndSize()
- PyUnicode_FSDecoder()
- PyUnicode_AsUTF8() -- recently modified
- _Py_stat(): check path
- getargs.c: 's', 'y' and 'z' formats
There are exceptions which accept embedded null bytes/characters:
- socket: AF_UNIX socket name