[SPARK-48863][SQL] Fix ClassCastException when parsing JSON with "spark.sql.json.enablePartialResults" enabled #47292

sadikovi · 2024-07-11T02:50:32Z

What changes were proposed in this pull request?

This PR fixes a bug in a corner case of JSON parsing when spark.sql.json.enablePartialResults is enabled.

When running the following query with the config set to true:

select from_json('{"a":"b","c":"d"}', 'array<struct<a:string, c:int>>')

the code would fail with

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: 
Lost task 0.0 in stage 4.0 (TID 4) (ip-10-110-51-101.us-west-2.compute.internal executor driver): 
java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class 
org.apache.spark.sql.catalyst.util.ArrayData (org.apache.spark.unsafe.types.UTF8String and 
org.apache.spark.sql.catalyst.util.ArrayData are in unnamed module of loader 'app')
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:53)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:53)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:172)
    at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:831)
    at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:893)

The patch fixes the issue by re-throwing PartialArrayDataResultException if parsing fails in this special case.

Why are the changes needed?

Fixes the bug that would prevent users from reading objects as arrays as introduced in SPARK-19595. This is more of a special case but it works with the flag off so it would be good to fix it when the flag is on.

Does this PR introduce any user-facing change?

Yes, but it is a bug fix so it would not have worked without this patch overall.
The parsing output will be different due to the partial results improvement:

Previously, we would get null (the partial results are disabled). With this patch and partial results enabled, this will return Array([b, null]). This is not specific to this patch but rather to the partial results feature in general.

How was this patch tested?

I added a unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

sadikovi · 2024-07-11T02:51:59Z

cc @HyukjinKwon @cloud-fan @dongjoon-hyun

HyukjinKwon · 2024-07-11T06:02:43Z

Merged to master and branch-3.5.

…rk.sql.json.enablePartialResults" enabled  ### What changes were proposed in this pull request?  This PR fixes a bug in a corner case of JSON parsing when `spark.sql.json.enablePartialResults` is enabled. When running the following query with the config set to true: ``` select from_json('{"a":"b","c":"d"}', 'array<struct<a:string, c:int>>') ``` the code would fail with ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (ip-10-110-51-101.us-west-2.compute.internal executor driver): java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class org.apache.spark.sql.catalyst.util.ArrayData (org.apache.spark.unsafe.types.UTF8String and org.apache.spark.sql.catalyst.util.ArrayData are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:172) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:831) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:893) ``` The patch fixes the issue by re-throwing PartialArrayDataResultException if parsing fails in this special case. ### Why are the changes needed?  Fixes the bug that would prevent users from reading objects as arrays as introduced in SPARK-19595. This is more of a special case but it works with the flag off so it would be good to fix it when the flag is on. ### Does this PR introduce _any_ user-facing change?  Yes, but it is a bug fix so it would not have worked without this patch overall. The parsing output will be different due to the partial results improvement: Previously, we would get `null` (the partial results are disabled). With this patch and partial results enabled, this will return `Array([b, null])`. This is not specific to this patch but rather to the partial results feature in general. ### How was this patch tested?  I added a unit test. ### Was this patch authored or co-authored using generative AI tooling?  No. Closes #47292 from sadikovi/SPARK-48863. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 31d5ea1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…rk.sql.json.enablePartialResults" enabled  ### What changes were proposed in this pull request?  This PR fixes a bug in a corner case of JSON parsing when `spark.sql.json.enablePartialResults` is enabled. When running the following query with the config set to true: ``` select from_json('{"a":"b","c":"d"}', 'array<struct<a:string, c:int>>') ``` the code would fail with ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (ip-10-110-51-101.us-west-2.compute.internal executor driver): java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class org.apache.spark.sql.catalyst.util.ArrayData (org.apache.spark.unsafe.types.UTF8String and org.apache.spark.sql.catalyst.util.ArrayData are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:172) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:831) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:893) ``` The patch fixes the issue by re-throwing PartialArrayDataResultException if parsing fails in this special case. ### Why are the changes needed?  Fixes the bug that would prevent users from reading objects as arrays as introduced in SPARK-19595. This is more of a special case but it works with the flag off so it would be good to fix it when the flag is on. ### Does this PR introduce _any_ user-facing change?  Yes, but it is a bug fix so it would not have worked without this patch overall. The parsing output will be different due to the partial results improvement: Previously, we would get `null` (the partial results are disabled). With this patch and partial results enabled, this will return `Array([b, null])`. This is not specific to this patch but rather to the partial results feature in general. ### How was this patch tested?  I added a unit test. ### Was this patch authored or co-authored using generative AI tooling?  No. Closes apache#47292 from sadikovi/SPARK-48863. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

sadikovi added 2 commits July 11, 2024 14:43

update

bf50449

add comments

13e2281

github-actions bot added the SQL label Jul 11, 2024

HyukjinKwon approved these changes Jul 11, 2024

View reviewed changes

cloud-fan approved these changes Jul 11, 2024

View reviewed changes

yaooqinn approved these changes Jul 11, 2024

View reviewed changes

HyukjinKwon closed this in 31d5ea1 Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48863][SQL] Fix ClassCastException when parsing JSON with "spark.sql.json.enablePartialResults" enabled #47292

[SPARK-48863][SQL] Fix ClassCastException when parsing JSON with "spark.sql.json.enablePartialResults" enabled #47292

Uh oh!

sadikovi commented Jul 11, 2024

Uh oh!

sadikovi commented Jul 11, 2024

Uh oh!

HyukjinKwon commented Jul 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

[SPARK-48863][SQL] Fix ClassCastException when parsing JSON with "spark.sql.json.enablePartialResults" enabled #47292

[SPARK-48863][SQL] Fix ClassCastException when parsing JSON with "spark.sql.json.enablePartialResults" enabled #47292

Uh oh!

Conversation

sadikovi commented Jul 11, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sadikovi commented Jul 11, 2024

Uh oh!

HyukjinKwon commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jul 11, 2024 •

edited

Loading