[SPARK-33134][SQL] Return partial results only for root JSON objects #30031

MaxGekk · 2020-10-13T13:19:59Z

What changes were proposed in this pull request?

In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as from_json() will return null for malformed nested JSON objects.

Why are the changes needed?

To not raise exception to users in the PERMISSIVE mode
To fix a regression and to have the same behavior as Spark 2.4.x has
Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields.

Does this PR introduce any user-facing change?

Yes. Before the changes, the code below:

    val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events")
    val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType)))
    val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event"))
    pokerhand_events.show

throws the exception even in the default PERMISSIVE mode:

java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)

After the changes:

+-----+
|event|
+-----+
| null|
+-----+

How was this patch tested?

Added a test to JsonFunctionsSuite.

MaxGekk · 2020-10-13T13:20:37Z

@HyukjinKwon Could you review this PR.

MaxGekk · 2020-10-13T16:49:54Z

The changes conflict with branch-3.0. Here is the backport to 3.0: #30032

HyukjinKwon · 2020-10-14T03:10:33Z

We probably need to redesign/refactoring JSON parsing logic here .. it's now pretty convoluted ..

HyukjinKwon · 2020-10-14T03:13:25Z

Merged to master.

…and JSON functions ### What changes were proposed in this pull request? This PR is a follow-up for [SPARK-33134](https://issues.apache.org/jira/browse/SPARK-33134) (#30031). I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls: With a file like this: ``` {"a": {"x": 1, "y": true}, "b": {"x": 1}} {"a": {"x": 2}, "b": {"x": 2}} ``` Reading the file results in column `b` as null even though it is a valid column. ```scala val df = spark.read .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>") .json("path") === a b null null {"x":2,"y":null} {"x":2} ``` However, b column should be: ``` {"x": 1} {"x": 2} ``` This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in `from_json`. Because this case was not tested, we missed it at the time. In order to fix both SPARK-33134 and SPARK-40646, we need to process `PartialResultException` in `convertArray` method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for `from_json` resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need `isRoot` check anymore in `convertObject` thus unblocking SPARK-40646. I updated the code to handle both cases. With these changes, we can correctly parse this case: ```scala val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0") checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null)))) ``` which was previously returning `null` for the root row. ### Why are the changes needed? Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests for SPARK-40646 as well as SPARK-33134. Closes #38090 from sadikovi/SPARK-40646. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…and JSON functions ### What changes were proposed in this pull request? This PR is a follow-up for [SPARK-33134](https://issues.apache.org/jira/browse/SPARK-33134) (apache#30031). I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls: With a file like this: ``` {"a": {"x": 1, "y": true}, "b": {"x": 1}} {"a": {"x": 2}, "b": {"x": 2}} ``` Reading the file results in column `b` as null even though it is a valid column. ```scala val df = spark.read .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>") .json("path") === a b null null {"x":2,"y":null} {"x":2} ``` However, b column should be: ``` {"x": 1} {"x": 2} ``` This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in `from_json`. Because this case was not tested, we missed it at the time. In order to fix both SPARK-33134 and SPARK-40646, we need to process `PartialResultException` in `convertArray` method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for `from_json` resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need `isRoot` check anymore in `convertObject` thus unblocking SPARK-40646. I updated the code to handle both cases. With these changes, we can correctly parse this case: ```scala val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0") checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null)))) ``` which was previously returning `null` for the root row. ### Why are the changes needed? Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests for SPARK-40646 as well as SPARK-33134. Closes apache#38090 from sadikovi/SPARK-40646. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk added 4 commits October 13, 2020 15:59

Add a test

60799f9

Fix

12e9fb2

Improve test

ae9ce52

Simplify test

813127a

Test more cases

bccbb9b

MaxGekk added 2 commits October 13, 2020 22:15

Test refactoring

aaf185c

Fix test

241eca5

HyukjinKwon approved these changes Oct 14, 2020

View reviewed changes

HyukjinKwon closed this in 05a62dc Oct 14, 2020

MaxGekk deleted the json-skip-row-wrong-schema branch December 11, 2020 20:28

MaxGekk mentioned this pull request Apr 21, 2021

[SPARK-35094][SQL]Spark from_json(JsonToStruct) function return wrong value in permissive mode #32252

Closed

sadikovi mentioned this pull request Oct 4, 2022

[SPARK-40646][SQL] Fix returning partial results in JSON data source and JSON functions #38090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33134][SQL] Return partial results only for root JSON objects #30031

[SPARK-33134][SQL] Return partial results only for root JSON objects #30031

Uh oh!

MaxGekk commented Oct 13, 2020 •

edited

Loading

Uh oh!

MaxGekk commented Oct 13, 2020

Uh oh!

MaxGekk commented Oct 13, 2020

Uh oh!

HyukjinKwon commented Oct 14, 2020

Uh oh!

HyukjinKwon commented Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-33134][SQL] Return partial results only for root JSON objects #30031

[SPARK-33134][SQL] Return partial results only for root JSON objects #30031

Uh oh!

Conversation

MaxGekk commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Oct 13, 2020

Uh oh!

MaxGekk commented Oct 13, 2020

Uh oh!

HyukjinKwon commented Oct 14, 2020

Uh oh!

HyukjinKwon commented Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaxGekk commented Oct 13, 2020 •

edited

Loading