[SPARK-25243][SQL] Use FailureSafeParser in from_json #22237

MaxGekk · 2018-08-26T16:10:43Z

What changes were proposed in this pull request?

In the PR, I propose to switch from_json on FailureSafeParser, and to make the function compatible to PERMISSIVE mode by default, and to support the FAILFAST mode as well. The DROPMALFORMED mode is not supported by from_json.

How was this patch tested?

It was tested by existing JsonSuite/CSVSuite, JsonFunctionsSuite and JsonExpressionsSuite as well as new tests for from_json which checks different modes.

SparkQA · 2018-08-26T18:02:17Z

Test build #95266 has finished for PR 22237 at commit 0452a2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-08-26T18:25:19Z

jenkins, retest this, please

SparkQA · 2018-08-26T21:33:40Z

Test build #95267 has finished for PR 22237 at commit 0452a2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-08-27T01:10:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+  @transient lazy val createParser = CreateJacksonParser.utf8String _
+  @transient lazy val parser = new FailureSafeParser[UTF8String](
+    input => rawParser.parse(input, createParser, identity[UTF8String]),
+    parsedOptions.parseMode,


I think we should keep using previous default mode FailFastMode? Now default mode becomes PermissiveMode.

Previous settings of FailFastMode didn't impact on the behavior because the mode option wasn't handled at all.

It is not handled by JacksonParser, and the behavior in here is somehow similar to PermissiveMode as @HyukjinKwon pointed out at https://github.com/apache/spark/pull/22237/files#r212850156, but not exactly the same.

Seems now the PermissiveMode on FailureSafeParser has different result on corrupted records. I noticed that some existing tests maybe changed due to that.

HyukjinKwon · 2018-08-27T01:45:50Z

sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

+
+    checkAnswer(
+      df.select(from_json($"value", schema, Map("mode" -> "DROPMALFORMED"))),
+      Row(null) :: Row(Row(2)) :: Nil)


How does it work for DROPMALFORMED mode? This doesn't actually drop the record like JSON datasource.

The DROPMALFORMED mode returns null for malformed JSON lines. User can filter them out later. @HyukjinKwon Do you know how to drop rows in UnaryExpressions?

Nope, only possibility I raised was to make it generator expression. I haven't proposed a parse mode for this reason so far.

HyukjinKwon · 2018-08-27T01:48:21Z

sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

+      Row(Row(null)) :: Row(Row(2)) :: Nil)
+
+    val exceptionOne = intercept[SparkException] {
+      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()


JsonToStructs resembles PERMISSIVE mode (from the first place) although their behaviours are slightly different. This is going to be different with PERMISSIVE and also FAILFAST modes. They are actually behaviour changes if we just use PERMISSIVE mode here by default (as @viirya pointed out).

Behavior of JsonToStructs is pretty close to PERMISSIVE actually. I have to make just a few small changes in tests that checks processing malformed inputs.

SparkQA · 2018-08-27T02:20:36Z

Test build #95269 has finished for PR 22237 at commit 47dbe6b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-27T07:30:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

-      nullableSchema,
-      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
+  @transient lazy val parsedOptions = new JSONOptions(options, timeZoneId.get)
+  @transient lazy val rawParser = new JacksonParser(nullableSchema, parsedOptions)


How about this?

@transient lazy val parser = { val parsedOptions = new JSONOptions(options, timeZoneId.get) val rawParser = new JacksonParser(nullableSchema, parsedOptions) val createParser = CreateJacksonParser.utf8String _ new FailureSafeParser[UTF8String]( input => rawParser.parse(input, createParser, identity[UTF8String]), parsedOptions.parseMode, schema, parsedOptions.columnNameOfCorruptRecord, parsedOptions.multiLine) }

maropu · 2018-08-27T07:45:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala

+      val actualSchema = StructType(struct.filterNot(_.name == columnNameOfCorruptRecord))
+      val resultRow = new GenericInternalRow(struct.length)
+      val nullResult = new GenericInternalRow(struct.length)
+      if (corruptFieldIndex.isDefined) {


Can we move actualSchema and resultRow into if (corruptFieldIndex.isDefined) { inside?

HyukjinKwon · 2018-08-27T14:52:45Z

I think one thing we could do this for now is, only to support both FAILFAST and PERMISSIVE mode and throws an exception otherwise, to match the current behaviour to PERMISSIVE mode, explain that in the migration guide.

MaxGekk · 2018-08-28T18:16:20Z

to match the current behaviour to PERMISSIVE mode, explain that in the migration guide.

@HyukjinKwon Should I target to Spark 3.0 or 2.4?

gatorsmile · 2018-08-28T18:30:59Z

If we can finish it before the code freeze, it will be 2.4; otherwise it is 3.0

SparkQA · 2018-08-28T23:12:31Z

Test build #95378 has finished for PR 22237 at commit c81e9ae.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-30T00:28:01Z

Test build #95436 has finished for PR 22237 at commit b172e8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-30T02:05:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+  @transient lazy val parser = {
+    val parsedOptions = new JSONOptions(options, timeZoneId.get)
+    val mode = parsedOptions.parseMode
+    require(mode == PermissiveMode || mode == FailFastMode,


I think we should move this verification into the constructor.

Also, can we use AnalysisException instead of require?

I didn't put require to the constructor body directly because of timeZoneId. If I move the checking up, I need to move val parsedOptions = new JSONOptions(options, timeZoneId.get) too (lazy or not lazy). Checking will force getting of timeZoneId.get which will raise an exception. I will check this today or tomorrow.

ok, thanks!

maropu · 2018-08-30T02:18:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala

 import org.apache.spark.unsafe.types.UTF8String

 class FailureSafeParser[IN](
    rawParser: IN => Seq[InternalRow],
    mode: ParseMode,
-    schema: StructType,
+    schema: DataType,


schema -> dataType?

SparkQA · 2018-08-30T15:25:58Z

Test build #95464 has finished for PR 22237 at commit fee61dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-31T00:33:41Z

sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

+      "Malformed records are detected in record parsing. Parse Mode: FAILFAST."))
+
+    val exception2 = intercept[AnalysisException] {
+      df.select(from_json($"value", schema, Map("mode" -> "DROPMALFORMED"))).collect()


Can you fix the code to throw an analysis exception in analysis phases instead of execution phases (.collect() called)?

I replaced it by AnalysisException but I think it is wrong decision. Throwing of AnalysisException at run-time looks ugly:

Caused by: org.apache.spark.sql.AnalysisException: from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.; at org.apache.spark.sql.catalyst.expressions.JsonToStructs.parser$lzycompute(jsonExpressions.scala:568) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.parser(jsonExpressions.scala:564) ... at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

I am going to replace it by something else or revert back to IllegalArgumentException.

HyukjinKwon · 2018-08-31T09:53:14Z

docs/sql-programming-guide.md

@@ -1897,6 +1897,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
  - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
  - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
  - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
+  - Since Spark 2.4, the from_json functions supports two modes - PERMISSIVE and FAILFAST. The modes can be set via the `mode` option. The default mode became PERMISSIVE. In previous versions, behavior of from_json did not conform to either PERMISSIVE nor FAILFAST, especially in processing of malformed JSON records.


nit: from_json -> `from_json`.

HyukjinKwon · 2018-08-31T09:55:23Z

I agree with the current approach but wanna make sure if we want this in 2.4.0 or 3.0.0 since there's no way to keep the previous behaviour and code freeze is super close. I actually prefer to go ahead in 3.0.0.

@gatorsmile and @cloud-fan, WDYT? I think this will likely break existing user apps.

SparkQA · 2018-08-31T13:39:45Z

Test build #95532 has finished for PR 22237 at commit 1cf4213.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-31T16:33:40Z

Test build #95541 has finished for PR 22237 at commit 9ad834f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-06T17:08:50Z

Test build #95760 has finished for PR 22237 at commit a433388.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-09-06T17:41:17Z

@HyukjinKwon I re-targeted the changes for Spark 3.0. Please, take a look at it one more time.

HyukjinKwon · 2018-09-10T08:22:43Z

retest this please

HyukjinKwon · 2018-09-10T08:22:56Z

Will take a look soon.

MaxGekk · 2018-09-10T10:15:07Z

Will take a look soon.

@HyukjinKwon Thank you. Waiting for your feedback.

SparkQA · 2018-09-10T11:44:18Z

Test build #95867 has finished for PR 22237 at commit a433388.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-11T00:12:42Z

Test build #95896 has finished for PR 22237 at commit 26287a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-09-13T17:46:50Z

@HyukjinKwon Please, take a look at it again.

…has more than 1 element for struct schema

HyukjinKwon · 2018-10-24T03:43:38Z

https://github.com/apache/spark/pull/22237/files#r223707899 makes sense to me. Addressed. LGTM from my side as well

cloud-fan · 2018-10-24T04:33:54Z

LGTM, pending jenkins.

SparkQA · 2018-10-24T07:05:02Z

Test build #97958 has finished for PR 22237 at commit b2988c7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-24T07:07:00Z

retest this please

SparkQA · 2018-10-24T10:53:40Z

Test build #97966 has finished for PR 22237 at commit b2988c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-24T11:09:51Z

thanks, merging to master!

HyukjinKwon · 2018-10-25T01:27:39Z

Thanks all!!

MaxGekk · 2018-10-31T15:38:40Z

@HyukjinKwon Thank you for following up work on the PR. @cloud-fan @viirya @maropu Thanks for your reviews.

gatorsmile · 2018-11-19T21:33:01Z

R/pkg/tests/fulltests/test_sparkSQL.R

@@ -1694,7 +1694,7 @@ test_that("column functions", {
  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
  schema2 <- structType(structField("date", "date"))
  s <- collect(select(df, from_json(df$col, schema2)))
-  expect_equal(s[[1]][[1]], NA)
+  expect_equal(s[[1]][[1]]$date, NA)


What is the reason we made this change?

Do you mean this particular line or in general?

This line was changed because in the PERMISSIVE mode we usually return a Row with null fields that we wasn't able to parse instead of just null for whole row.

In general, to support the PERMISSIVE and FAILFAST modes as for JSON datasource. Before the changes from_json didn't support any modes and the columnNameOfCorruptRecord option in particular.

## What changes were proposed in this pull request? In the PR, I propose to switch `from_json` on `FailureSafeParser`, and to make the function compatible to `PERMISSIVE` mode by default, and to support the `FAILFAST` mode as well. The `DROPMALFORMED` mode is not supported by `from_json`. ## How was this patch tested? It was tested by existing `JsonSuite`/`CSVSuite`, `JsonFunctionsSuite` and `JsonExpressionsSuite` as well as new tests for `from_json` which checks different modes. Closes apache#22237 from MaxGekk/from_json-failuresafe. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

viirya reviewed Aug 27, 2018

View reviewed changes

HyukjinKwon reviewed Aug 27, 2018

View reviewed changes

maropu reviewed Aug 27, 2018

View reviewed changes

maropu reviewed Aug 30, 2018

View reviewed changes

maropu reviewed Aug 31, 2018

View reviewed changes

HyukjinKwon reviewed Aug 31, 2018

View reviewed changes

MaxGekk and others added 16 commits October 24, 2018 11:17

Throwing AnalysisException instead of IllegalArgumentException

55be20b

Check that the AnalysisException is thrown during producing a plan

ce49b24

Replacing AnalysisException by IllegalArgumentException

57eb59f

Fix a test

20b7522

Improving the test

63b8b66

Removing the BadRecordException handler

a5489f5

Updating test's title

9904903

Addressing Hyukjin's review comments

bda3a4e

Produce null and put input to the corrupted column if an input array …

fa20fd2

…has more than 1 element for struct schema

Removing special handling of empty strings

2663696

Don't allow parsing arrays as structs by from_json

b84b343

Modified a test to check the spark.sql.columnNameOfCorruptRecord config

e22b974

Improving tests

4157141

Fix a python test

54be09c

add migration guide

3f04f7f

address comments

b2988c7

HyukjinKwon force-pushed the from_json-failuresafe branch from d91f34f to b2988c7 Compare October 24, 2018 03:42

asfgit closed this in 4d6704d Oct 24, 2018

MaxGekk mentioned this pull request Nov 7, 2018

[SPARK-25935][SQL] Prevent null rows from JSON parser #22938

Closed

gatorsmile reviewed Nov 19, 2018

View reviewed changes

MaxGekk deleted the from_json-failuresafe branch August 17, 2019 13:35

[SPARK-25243][SQL] Use FailureSafeParser in from_json #22237

[SPARK-25243][SQL] Use FailureSafeParser in from_json #22237

Uh oh!

Conversation

MaxGekk commented Aug 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

MaxGekk commented Aug 26, 2018

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

viirya Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Aug 28, 2018

Uh oh!

gatorsmile commented Aug 28, 2018

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

SparkQA commented Aug 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

SparkQA commented Sep 6, 2018

MaxGekk commented Aug 26, 2018 •

edited

Loading

viirya Aug 27, 2018 •

edited

Loading

viirya Aug 27, 2018 •

edited

Loading

HyukjinKwon commented Aug 27, 2018 •

edited

Loading

maropu Aug 30, 2018 •

edited

Loading

HyukjinKwon commented Aug 31, 2018 •

edited

Loading

MaxGekk commented Sep 10, 2018 •

edited

Loading

MaxGekk Nov 19, 2018 •

edited

Loading