Skip to content

email.message get_payload throws UnicodeEncodeError with some surrogate Unicode characters #94606

Closed
@sidney

Description

@sidney

email.message get_payload gets a UnicodeEncodeError if the message body contains a line that has either:
a Unicode surrogate code point that is valid for surrogateescape encoding (U-DC80 through U-DCFF) and a non ASCII UTF-8 character
OR
a Unicode surrogate character that is not valid for surrogateescape encoding

Here is a minimal code example with one of the cases commented out

from email import message_from_string
from email.message import EmailMessage

m = message_from_string("surrogate char \udcc3 and 8-bit utf-8 ë on same line")
# m = message_from_string("surrogate char \udfff does it by itself")
payload = m.get_payload(decode=True)

On my python 3.10.5 on macOS this produces:

Traceback (most recent call last):
  File "/Users/sidney/tmp/./test5.py", line 8, in <module>
    payload = m.get_payload(decode=True)
  File "/usr/local/Cellar/python@3.10/3.10.5/Frameworks/Python.framework/Versions/3.10/lib/python3.10/email/message.py", line 264, in get_payload
    bpayload = payload.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode character '\xeb' in position 33: ordinal not in range(128)

This was tested on python 3.10.5 on macOS, however I tracked it down based on a report in the wild that was running python 3.8 on Ubuntu 20.04 processing actual emails

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions