Skip to content

codecs.encode with utf-* encoding and errors returing str rejects surrogates blindly #127305

@litlighilit

Description

@litlighilit

Bug report

Bug description:

For codecs.encode,
with utf-* encoding, and a custom errors which returns str,
if you pass some characters that are not invalid UTF characters (e.g. surrogates),
UnicodeEncodeError is just raised and there's not the expected (and documented) case
where the returned str is appended.

import codecs

ERRORS_NAME = "returning non-ascii"

# something being not encod-able via `utf-*`
BAD_UTF = "\uD800"  # the first high surrogate character

def register_repl_error(repl):
    def error_handle(exc: UnicodeEncodeError):
        return (repl, exc.end)

    codecs.register_error(ERRORS_NAME, error_handle)



def encode_surrogate(encoding, repl):
    register_repl_error(repl)
    max_enc_len = 9
    try:
        res = codecs.encode(BAD_UTF, encoding, ERRORS_NAME)
    except UnicodeEncodeError as err:
        reason = err.reason
        print(f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) " +
              f"with custom errors {ERRORS_NAME} raises with {reason=}")

    else:
        print(f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) " +
              f"with custom errors {ERRORS_NAME} returns {res}")


## emoji

NON_ASCII = "\U0001F605" # \N{smiling face with open mouth and cold sweat}: 😅

for i in ('8', '16', '32', '16-le', '16-be'):
    encode_surrogate("utf-" + i, NON_ASCII)


print('-'*3)


## cjk

NON_ASCII = "龍"  # loong in Chinese


### zh
for enc in ("gbk", "big5"):
    encode_surrogate(enc, NON_ASCII)

### jp
for enc in ("Shift_JIS", "EUC-JP"):
    encode_surrogate(enc, NON_ASCII)

Output:

codecs.encode('\ud800', encoding=utf-8    ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16   ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-32   ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-le) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-be) with custom errors returning non-ascii raises with reason='surrogates not allowed'
---
codecs.encode('\ud800', encoding=gbk      ) with custom errors returning non-ascii returns b'\xfd\x88'
codecs.encode('\ud800', encoding=big5     ) with custom errors returning non-ascii returns b'\xc0s'
codecs.encode('\ud800', encoding=Shift_JIS) with custom errors returning non-ascii returns b'\x97\xb4'
codecs.encode('\ud800', encoding=EUC-JP   ) with custom errors returning non-ascii returns b'\xce\xb6'

CPython versions tested on:

3.9, 3.11, 3.12, 3.13, 3.14

Operating systems tested on:

Linux, Windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions