[SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377

itholic · 2024-03-05T00:37:46Z

What changes were proposed in this pull request?

This PR introduces an enhancement to the error messages generated by PySpark's DataFrame API, adding detailed context about the location within the user's PySpark code where the error occurred.

This directly adds a PySpark user call site information into DataFrameQueryContext added from #43334, aiming to provide PySpark users with the same level of detailed error context for better usability and debugging efficiency for DataFrame APIs.

This PR also introduces QueryContext.pysparkCallSite and QueryContext.pysparkFragment to get a PySpark information from the query context easily.

This PR also enhances the functionality of check_error so that it can test the query context if it exists.

Why are the changes needed?

To improve a debuggability. Errors originating from PySpark operations can be difficult to debug with limited context in the error messages. While improvements on the JVM side have been made to offer detailed error contexts, PySpark errors often lack this level of detail.

Does this PR introduce any user-facing change?

No API changes, but error messages will include a reference to the exact line of user code that triggered the error, in addition to the existing descriptive error message.

For example, consider the following PySpark code snippet that triggers a DIVIDE_BY_ZERO error:

1  spark.conf.set("spark.sql.ansi.enabled", True)
2  
3  df = spark.range(10)
4  df.select(df.id / 0).show()

Before:

pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"divide" was called from
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

After:

pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"divide" was called from
/.../spark/python/test_pyspark_error.py:4

Now the error message points out the exact problematic code path with file name and line number that user writes.

Points to the actual problem site instead of the site where the action was called

Even when action calling after multiple transform operations are mixed, the exact problematic site can be provided to the user:

In:

  1 spark.conf.set("spark.sql.ansi.enabled", True)
  2 df = spark.range(10)
  3
  4 df1 = df.withColumn("div_ten", df.id / 10)
  5 df2 = df1.withColumn("plus_four", df.id + 4)
  6
  7 # This is problematic divide operation that occurs DIVIDE_BY_ZERO.
  8 df3 = df2.withColumn("div_zero", df.id / 0)
  9 df4 = df3.withColumn("minus_five", df.id / 5)
 10
 11 df4.collect()

Out:

pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"divide" was called from
/.../spark/python/test_pyspark_error.py:8

How was this patch tested?

Added UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

itholic · 2024-03-05T00:42:03Z

cc @HyukjinKwon I'm still working on Spark Connect support and unit tests but the basic structure is ready for review.

FYI, also cc @MaxGekk as you made a similar contribution on the JVM side.

python/pyspark/errors/utils.py

itholic · 2024-03-28T15:45:02Z

Hi, @HyukjinKwon @MaxGekk could you take a look at the prototype when you have some time?

python/pyspark/errors/utils.py

…ontext_for_dataframe_api

itholic · 2024-04-03T00:26:48Z

Added QueryContext testing for DataFrameContext and UTs. The CI failures seems not related. cc @HyukjinKwon FYI

…ontext_for_dataframe_api

python/pyspark/errors/exceptions/captured.py

python/pyspark/testing/utils.py

HyukjinKwon · 2024-04-03T01:49:17Z

cc @cloud-fan too

itholic · 2024-04-03T06:25:57Z

python/pyspark/sql/tests/test_dataframe.py

+                exception=pe.exception,
+                error_class="INVALID_IDENTIFIER",
+                message_parameters={"ident": "non-existing-table"},
+                query_context_type=None,


FYI: None is default, so we don't need to specify like this when QueryContext not existing, but I made this test for explicit example.

python/pyspark/errors/utils.py

python/pyspark/sql/column.py

python/pyspark/errors/utils.py

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala

sql/core/src/main/scala/org/apache/spark/sql/package.scala

python/pyspark/sql/column.py

…ontext_for_dataframe_api

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

cloud-fan · 2024-04-11T01:41:07Z

thanks, merging to master!

itholic · 2024-04-11T01:55:35Z

Thanks @cloud-fan @ueshin @HyukjinKwon @xinrong-meng for the review!

HyukjinKwon · 2024-04-15T08:35:23Z

Let's clarify why #45377 (comment) happens before we move further. That shouldn't happen from my understanding.

If we go with the current approach, it would need more changes. e.g., Column.substr because it takes a different set of arguments and types.

cloud-fan · 2024-04-15T09:46:09Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

+   *                            operation originated.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(


@HyukjinKwon This probably can't cover all the cases, and we may need to add more overloads for certain functions that require non-expression parameters, but it shouldn't be many.

I think it's better than using ThreadLocal which can be quite fragile to pass values between Python and JVM.

itholic · 2024-04-15T11:33:57Z

The difficulty with the previous method was that it was not easy to perfectly sync the data between two separately operating TheadLocal, CurrentOrigin and PySparkCurrentOrigin.

After taking deeper look at the structure, I think we may be able to make the CurrentOrigin more flexible to support PySpark error context instead of adding a separate ThreadLocal like PySparkCurrentOrigin.

If it works, it seems possible to improve the structure to a more flexible while maintaining the existing communication rules between Python and JVM without adding helper functions such as PySpark-specific fn.

Let me give it a try and create a PR to refactoring the current structure, and ping you guys.

itholic · 2024-04-15T12:08:49Z

Let me give it a try and create a PR to refactoring the current structure, and ping you guys.

Created #46063.

HyukjinKwon · 2024-04-16T02:09:54Z

perfectly sync the data between two separately operating TheadLocal, CurrentOrigin and PySparkCurrentOrigin.

Why is that?

itholic · 2024-04-16T03:06:15Z

Because I called PySparkCurrentOrigin directly on the DataFrameQueryContext without utilizing withOrigin in the initial implementation. I realized it from recent review from the refactoring PR, so I'm currently trying to reintroduce PySparkCurrentOrigin there.

…rk Connect ### What changes were proposed in this pull request? This PR proposes to Implement DataFrameQueryContext in Spark Connect. 1. Add two new protobuf messages packed together with `Expression`: ```proto message Origin { // (Required) Indicate the origin type. oneof function { PythonOrigin python_origin = 1; } } message PythonOrigin { // (Required) Name of the origin, for example, the name of the function string fragment = 1; // (Required) Callsite to show to end users, for example, stacktrace. string call_site = 2; } ``` 2. Merge `DataFrameQueryContext.pysparkFragment` and `DataFrameQueryContext.pysparkcallSite` to existing `DataFrameQueryContext.fragment` and `DataFrameQueryContext.callSite` 3. Separate `QueryContext` into `SQLQueryContext` and `DataFrameQueryContext` for consistency w/ Scala side 4. Implement the origin logic. `current_origin` thread local holds the current call site/the function name, and `Expression` gets it from it. They are set to individual expression messages, and are used when analysis happens - this resembles Spark SQL implementation. See also #45377. ### Why are the changes needed? See #45377 ### Does this PR introduce _any_ user-facing change? Yes, same as #45377 but in Spark Connect. ### How was this patch tested? Same unittests reused in Spark Connect. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46789 from HyukjinKwon/connect-context. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR proposes to display correct call site information from IPython Notebook. ### Why are the changes needed? We added `DataFrameQueryContext` for PySpark error message from #45377, but it does not working very well from IPython Notebook. ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing error message from IPython Notebook will be improved: **Before** <img width="1124" alt="Screenshot 2024-06-18 at 5 15 56 PM" src="https://github.com/apache/spark/assets/44108233/3e3aee2c-5bb0-4858-b392-e845b7280d31"> **After** <img width="1163" alt="Screenshot 2024-06-19 at 8 45 05 AM" src="https://github.com/apache/spark/assets/44108233/81741d15-cac9-41e7-815a-5d84f1176c73"> **NOTE:** This also works when command is executed across multiple cells: <img width="1175" alt="Screenshot 2024-06-19 at 8 42 29 AM" src="https://github.com/apache/spark/assets/44108233/d65fbf79-d621-4ae0-b220-2f7923cc3666"> ### How was this patch tested? Manually tested with IPython Notebook. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47009 from itholic/error_context_on_notebook. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

gengliangwang

Late LGTM!

### What changes were proposed in this pull request? This PR proposes to display correct call site information from IPython Notebook. ### Why are the changes needed? We added `DataFrameQueryContext` for PySpark error message from apache#45377, but it does not working very well from IPython Notebook. ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing error message from IPython Notebook will be improved: **Before** <img width="1124" alt="Screenshot 2024-06-18 at 5 15 56 PM" src="https://github.com/apache/spark/assets/44108233/3e3aee2c-5bb0-4858-b392-e845b7280d31"> **After** <img width="1163" alt="Screenshot 2024-06-19 at 8 45 05 AM" src="https://github.com/apache/spark/assets/44108233/81741d15-cac9-41e7-815a-5d84f1176c73"> **NOTE:** This also works when command is executed across multiple cells: <img width="1175" alt="Screenshot 2024-06-19 at 8 42 29 AM" src="https://github.com/apache/spark/assets/44108233/d65fbf79-d621-4ae0-b220-2f7923cc3666"> ### How was this patch tested? Manually tested with IPython Notebook. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47009 from itholic/error_context_on_notebook. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL PYTHON labels Mar 5, 2024

ueshin reviewed Mar 8, 2024

View reviewed changes

python/pyspark/errors/utils.py Outdated Show resolved Hide resolved

itholic changed the title ~~[WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors~~ [WIP][SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors Mar 28, 2024

Prototype

2c1d5d8

itholic force-pushed the error_context_for_dataframe_api branch from 24fd3a0 to 2c1d5d8 Compare March 28, 2024 15:42

itholic changed the title ~~[WIP][SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors~~ [SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors Mar 28, 2024

itholic marked this pull request as ready for review March 28, 2024 15:42

itholic commented Mar 28, 2024

View reviewed changes

python/pyspark/errors/utils.py Outdated Show resolved Hide resolved

Merge branch 'master' of https://github.com/apache/spark into error_c…

376fc46

…ontext_for_dataframe_api

itholic changed the title ~~[SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors~~ [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors Apr 1, 2024

itholic added 2 commits April 2, 2024 17:15

Merge branch 'master' of https://github.com/apache/spark into error_c…

174a929

…ontext_for_dataframe_api

Support query context testing and added UTs

8ab1edf

github-actions bot added the CONNECT label Apr 2, 2024

Merge branch 'master' of https://github.com/apache/spark into error_c…

5906852

…ontext_for_dataframe_api

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

python/pyspark/errors/exceptions/captured.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

python/pyspark/errors/exceptions/captured.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

python/pyspark/testing/utils.py Outdated Show resolved Hide resolved

itholic added 3 commits April 3, 2024 11:04

resolve comments

f3a7bd4

Add JIRA pointer for testing

bbaa399

Silence the linter

b9f54f1

itholic commented Apr 3, 2024

View reviewed changes

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

python/pyspark/errors/utils.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

python/pyspark/errors/utils.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

python/pyspark/sql/column.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

python/pyspark/errors/utils.py Outdated Show resolved Hide resolved

itholic added 2 commits April 9, 2024 21:31

Adress None properly

672c176

Simplifying

1304c2b

cloud-fan reviewed Apr 9, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 9, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/package.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 9, 2024

View reviewed changes

python/pyspark/sql/column.py Outdated Show resolved Hide resolved

itholic added 3 commits April 10, 2024 11:29

Merge branch 'master' of https://github.com/apache/spark into error_c…

ff4037b

…ontext_for_dataframe_api

Respect spark.sql.stackTracesInDataFrameContext

1d8df34

Add captureStackTrace to remove duplication

95f7848

cloud-fan reviewed Apr 10, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 10, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 10, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Column.scala Outdated Show resolved Hide resolved

pysparkLoggingInfo -> pysparkErrorContext

1dd53ed

xinrong-meng approved these changes Apr 10, 2024

View reviewed changes

cloud-fan approved these changes Apr 11, 2024

View reviewed changes

cloud-fan closed this in 86ae0d2 Apr 11, 2024

cloud-fan reviewed Apr 15, 2024

View reviewed changes

HyukjinKwon mentioned this pull request May 29, 2024

[SPARK-48459][CONNECT][PYTHON] Implement DataFrameQueryContext in Spark Connect #46789

Closed

itholic mentioned this pull request Jun 18, 2024

[SPARK-48650][PYTHON] Display correct call site from IPython Notebook #47009

Closed

gengliangwang reviewed Aug 8, 2024

View reviewed changes

[SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377

[SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377

Uh oh!

Conversation

itholic commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Points to the actual problem site instead of the site where the action was called

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

itholic commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

itholic commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

itholic commented Apr 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Apr 3, 2024

Uh oh!

itholic Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Apr 11, 2024

Uh oh!

itholic commented Apr 11, 2024

Uh oh!

HyukjinKwon commented Apr 15, 2024

Uh oh!

cloud-fan Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic commented Apr 15, 2024

Uh oh!

itholic commented Apr 15, 2024

Uh oh!

HyukjinKwon commented Apr 16, 2024

Uh oh!

itholic commented Apr 16, 2024

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

itholic commented Mar 5, 2024 •

edited

Loading

itholic commented Mar 5, 2024 •

edited

Loading

itholic commented Mar 28, 2024 •

edited

Loading

itholic Apr 3, 2024 •

edited

Loading

cloud-fan Apr 15, 2024 •

edited

Loading