Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 13, 2020

What changes were proposed in this pull request?

In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as from_json() will return null for malformed nested JSON objects.

Why are the changes needed?

  1. To not raise exception to users in the PERMISSIVE mode
  2. To fix a regression and to have the same behavior as Spark 2.4.x has
  3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields.

Does this PR introduce any user-facing change?

Yes. Before the changes, the code below:

    val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events")
    val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType)))
    val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event"))
    pokerhand_events.show

throws the exception even in the default PERMISSIVE mode:

java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)

After the changes:

+-----+
|event|
+-----+
| null|
+-----+

How was this patch tested?

Added a test to JsonFunctionsSuite.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 13, 2020

@HyukjinKwon Could you review this PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 13, 2020

The changes conflict with branch-3.0. Here is the backport to 3.0: #30032

@HyukjinKwon
Copy link
Member

We probably need to redesign/refactoring JSON parsing logic here .. it's now pretty convoluted ..

@HyukjinKwon
Copy link
Member

Merged to master.

@MaxGekk MaxGekk deleted the json-skip-row-wrong-schema branch December 11, 2020 20:28
MaxGekk pushed a commit that referenced this pull request Oct 17, 2022
…and JSON functions

### What changes were proposed in this pull request?

This PR is a follow-up for [SPARK-33134](https://issues.apache.org/jira/browse/SPARK-33134) (#30031).

I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls:

With a file like this:
```
{"a": {"x": 1, "y": true}, "b": {"x": 1}}
{"a": {"x": 2}, "b": {"x": 2}}
```

Reading the file results in column `b` as null even though it is a valid column.
```scala
val df = spark.read
  .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>")
  .json("path")

===

a	                b
null	                null
{"x":2,"y":null}	{"x":2}
```

However, b column should be:
```
{"x": 1}
{"x": 2}
```

This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in `from_json`. Because this case was not tested, we missed it at the time.

In order to fix both SPARK-33134 and SPARK-40646, we need to process `PartialResultException` in `convertArray` method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for `from_json` resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need `isRoot` check anymore in `convertObject` thus unblocking SPARK-40646.

I updated the code to handle both cases. With these changes, we can correctly parse this case:
```scala
val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0")
checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null))))
```
which was previously returning `null` for the root row.

### Why are the changes needed?

Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests for SPARK-40646 as well as SPARK-33134.

Closes #38090 from sadikovi/SPARK-40646.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…and JSON functions

### What changes were proposed in this pull request?

This PR is a follow-up for [SPARK-33134](https://issues.apache.org/jira/browse/SPARK-33134) (apache#30031).

I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls:

With a file like this:
```
{"a": {"x": 1, "y": true}, "b": {"x": 1}}
{"a": {"x": 2}, "b": {"x": 2}}
```

Reading the file results in column `b` as null even though it is a valid column.
```scala
val df = spark.read
  .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>")
  .json("path")

===

a	                b
null	                null
{"x":2,"y":null}	{"x":2}
```

However, b column should be:
```
{"x": 1}
{"x": 2}
```

This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in `from_json`. Because this case was not tested, we missed it at the time.

In order to fix both SPARK-33134 and SPARK-40646, we need to process `PartialResultException` in `convertArray` method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for `from_json` resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need `isRoot` check anymore in `convertObject` thus unblocking SPARK-40646.

I updated the code to handle both cases. With these changes, we can correctly parse this case:
```scala
val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0")
checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null))))
```
which was previously returning `null` for the root row.

### Why are the changes needed?

Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests for SPARK-40646 as well as SPARK-33134.

Closes apache#38090 from sadikovi/SPARK-40646.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants