-
-
Notifications
You must be signed in to change notification settings - Fork 32.8k
Open
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)stdlibStandard Library Python modules in the Lib/ directoryStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error
Description
Bug report
Bug description:
For codecs.encode
,
with utf-*
encoding, and a custom errors
which returns str
,
if you pass some characters that are not invalid UTF characters (e.g. surrogates),
UnicodeEncodeError
is just raised and there's not the expected (and documented) case
where the returned str
is appended.
import codecs
ERRORS_NAME = "returning non-ascii"
# something being not encod-able via `utf-*`
BAD_UTF = "\uD800" # the first high surrogate character
def register_repl_error(repl):
def error_handle(exc: UnicodeEncodeError):
return (repl, exc.end)
codecs.register_error(ERRORS_NAME, error_handle)
def encode_surrogate(encoding, repl):
register_repl_error(repl)
max_enc_len = 9
try:
res = codecs.encode(BAD_UTF, encoding, ERRORS_NAME)
except UnicodeEncodeError as err:
reason = err.reason
print(f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) " +
f"with custom errors {ERRORS_NAME} raises with {reason=}")
else:
print(f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) " +
f"with custom errors {ERRORS_NAME} returns {res}")
## emoji
NON_ASCII = "\U0001F605" # \N{smiling face with open mouth and cold sweat}: 😅
for i in ('8', '16', '32', '16-le', '16-be'):
encode_surrogate("utf-" + i, NON_ASCII)
print('-'*3)
## cjk
NON_ASCII = "龍" # loong in Chinese
### zh
for enc in ("gbk", "big5"):
encode_surrogate(enc, NON_ASCII)
### jp
for enc in ("Shift_JIS", "EUC-JP"):
encode_surrogate(enc, NON_ASCII)
Output:
codecs.encode('\ud800', encoding=utf-8 ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16 ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-32 ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-le) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-be) with custom errors returning non-ascii raises with reason='surrogates not allowed'
---
codecs.encode('\ud800', encoding=gbk ) with custom errors returning non-ascii returns b'\xfd\x88'
codecs.encode('\ud800', encoding=big5 ) with custom errors returning non-ascii returns b'\xc0s'
codecs.encode('\ud800', encoding=Shift_JIS) with custom errors returning non-ascii returns b'\x97\xb4'
codecs.encode('\ud800', encoding=EUC-JP ) with custom errors returning non-ascii returns b'\xce\xb6'
CPython versions tested on:
3.9, 3.11, 3.12, 3.13, 3.14
Operating systems tested on:
Linux, Windows
Metadata
Metadata
Assignees
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)stdlibStandard Library Python modules in the Lib/ directoryStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error
Projects
Status
No status