Skip to content

Commit 8c8016a

Browse files
advancedxyHyukjinKwon
authored andcommitted
[SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2
### What changes were proposed in this pull request? This PR allows non-ascii string as an exception message in Python 2 by explicitly en/decoding in case of `str` in Python 2. ### Why are the changes needed? Previously PySpark will hang when the `UnicodeDecodeError` occurs and the real exception cannot be passed to the JVM side. See the reproducer as below: ```python def f(): raise Exception("中") spark = SparkSession.builder.master('local').getOrCreate() spark.sparkContext.parallelize([1]).map(lambda x: f()).count() ``` ### Does this PR introduce any user-facing change? User may not observe hanging for the similar cases. ### How was this patch tested? Added a new test and manually checking. This pr is based on #18324, credits should also go to dataknocker. To make lint-python happy for python3, it also includes a followup fix for #25814 Closes #25847 from advancedxy/python_exception_19926_and_21045. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
1 parent 4080c4b commit 8c8016a

File tree

3 files changed

+26
-1
lines changed

3 files changed

+26
-1
lines changed

python/pyspark/sql/utils.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@
1818
import py4j
1919
import sys
2020

21+
if sys.version_info.major >= 3:
22+
unicode = str
23+
2124

2225
class CapturedException(Exception):
2326
def __init__(self, desc, stackTrace, cause=None):

python/pyspark/tests/test_worker.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# -*- encoding: utf-8 -*-
12
#
23
# Licensed to the Apache Software Foundation (ASF) under one or more
34
# contributor license agreements. See the NOTICE file distributed with
@@ -150,6 +151,20 @@ def test_with_different_versions_of_python(self):
150151
finally:
151152
self.sc.pythonVer = version
152153

154+
def test_python_exception_non_hanging(self):
155+
# SPARK-21045: exceptions with no ascii encoding shall not hanging PySpark.
156+
try:
157+
def f():
158+
raise Exception("exception with 中 and \xd6\xd0")
159+
160+
self.sc.parallelize([1]).map(lambda x: f()).count()
161+
except Py4JJavaError as e:
162+
if sys.version_info.major < 3:
163+
# we have to use unicode here to avoid UnicodeDecodeError
164+
self.assertRegexpMatches(unicode(e).encode("utf-8"), "exception with 中")
165+
else:
166+
self.assertRegexpMatches(str(e), "exception with 中")
167+
153168

154169
class WorkerReuseTest(PySparkTestCase):
155170

python/pyspark/worker.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -598,8 +598,15 @@ def process():
598598
process()
599599
except Exception:
600600
try:
601+
exc_info = traceback.format_exc()
602+
if isinstance(exc_info, bytes):
603+
# exc_info may contains other encoding bytes, replace the invalid bytes and convert
604+
# it back to utf-8 again
605+
exc_info = exc_info.decode("utf-8", "replace").encode("utf-8")
606+
else:
607+
exc_info = exc_info.encode("utf-8")
601608
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
602-
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
609+
write_with_length(exc_info, outfile)
603610
except IOError:
604611
# JVM close the socket
605612
pass

0 commit comments

Comments
 (0)