Closed
Description
email.message get_payload gets a UnicodeEncodeError if the message body contains a line that has either:
a Unicode surrogate code point that is valid for surrogateescape encoding (U-DC80 through U-DCFF) and a non ASCII UTF-8 character
OR
a Unicode surrogate character that is not valid for surrogateescape encoding
Here is a minimal code example with one of the cases commented out
from email import message_from_string
from email.message import EmailMessage
m = message_from_string("surrogate char \udcc3 and 8-bit utf-8 ë on same line")
# m = message_from_string("surrogate char \udfff does it by itself")
payload = m.get_payload(decode=True)
On my python 3.10.5 on macOS this produces:
Traceback (most recent call last):
File "/Users/sidney/tmp/./test5.py", line 8, in <module>
payload = m.get_payload(decode=True)
File "/usr/local/Cellar/python@3.10/3.10.5/Frameworks/Python.framework/Versions/3.10/lib/python3.10/email/message.py", line 264, in get_payload
bpayload = payload.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode character '\xeb' in position 33: ordinal not in range(128)
This was tested on python 3.10.5 on macOS, however I tracked it down based on a report in the wild that was running python 3.8 on Ubuntu 20.04 processing actual emails
Linked PRs
- gh-94606: Fix error when message with Unicode surrogate not surrogateescaped string #94641
- [3.12] gh-94606: Fix error when message with Unicode surrogate not surrogateescaped string (GH-94641) #112971
- [3.11] gh-94606: Fix error when message with Unicode surrogate not surrogateescaped string (GH-94641) #112972