Skip to content

[SPARK-26303][SQL] Return partial results for bad JSON records #23253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 7, 2018

What changes were proposed in this pull request?

In the PR, I propose to return partial results from JSON datasource and JSON functions in the PERMISSIVE mode if some of JSON fields are parsed and converted to desired types successfully. The changes are made only for StructType. Whole bad JSON records are placed into the corrupt column specified by the columnNameOfCorruptRecord option or SQL config.

Partial results are not returned for malformed JSON input.

How was this patch tested?

Added new UT which checks converting JSON strings with one invalid and one valid field at the end of the string.

@SparkQA
Copy link

SparkQA commented Dec 7, 2018

Test build #99824 has finished for PR 23253 at commit e13673d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 7, 2018

jenkins, retest this, please

@@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide

- In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully.

- In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if returned row contains non null fields, how do we know if the row is read from a bad JSON record or a correct JSON record?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this behavior is also defined at some places like DataFrameReader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the row was produced from a bad JSON record, the bad record is placed to the corrupt column otherwise the corrupt column contains null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no corrupt column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PERMISSIVE mode, no way but at the moment (without the PR) you cannot distinguish a row produced from a bad record from a row produced from JSON object with all null fields too.

A row itself with all null cannot be an indicator of bad record. Need an additional flag. null or non-null in the corrupt column plays such role.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such behavior change, shall we add a config to roll back to previous behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you should also update other places where defines previous behavior, like DataFrameReader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan This PR propose similar changes as in #23120 . Could you take a look at it.

For such behavior change, shall we add a config to roll back to previous behavior?

I don't think it makes sense to introduce global SQL config for this particular case. The risk of breaking users apps is low because apps logic cannot based only on presence of all nulls in row. All nulls don't differentiate bad and not-bad JSON records. From my point of view, a note in the migration guide is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds reasonable.

@SparkQA
Copy link

SparkQA commented Dec 7, 2018

Test build #99828 has finished for PR 23253 at commit e13673d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99874 has finished for PR 23253 at commit 62a6795.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99875 has finished for PR 23253 at commit b19b3e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99879 has finished for PR 23253 at commit fc088df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @HyukjinKwon

- In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully.
- In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, the returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully.

- In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, the returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does from_csv support it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_csv was added recently. It didn't exist in 2.4

if (badRecordException.isEmpty) {
row
} else {
throw BadRecordException(() => UTF8String.EMPTY_UTF8, () => Some(row), badRecordException.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to say that, we don't know the original record here, and it will be filled later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can create a new exception type and use it here, which only carries the row and the exception

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate exception

@cloud-fan
Copy link
Contributor

LGTM except a code style comment

@SparkQA
Copy link

SparkQA commented Dec 9, 2018

Test build #99887 has finished for PR 23253 at commit 07e0498.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @param partialResult the partial result of parsing a bad record.
* @param cause the actual exception about why the parser cannot return full result.
*/
case class PartialResultException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, is this intentional? It looks like javax.naming.PartialResultException to me. Not only the same name, but also the semantics.

@cloud-fan . Is this okay? Or, shall we use more distinguishable name like SparkPartialResultException instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to carry the InternalRow here. I'm fine with the current name, as we don't have Spark prefix in BadRecordException

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it~ Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait .. but let's just rename it if possible .. the cost of renaming is 0 but there are some benefits by that ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we don't have to standardise the name but let's use another name that doesn't conflict with Java's libraries.

@SparkQA
Copy link

SparkQA commented Dec 9, 2018

Test build #99888 has finished for PR 23253 at commit 9ca9248.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PartialResultException(

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99953 has finished for PR 23253 at commit 9ca9248.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PartialResultException(

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let me just get this in. Looks the exception is not going to be exposed to users anyway.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 4e1d859 Dec 11, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In the PR, I propose to return partial results from JSON datasource and JSON functions in the PERMISSIVE mode if some of JSON fields are parsed and converted to desired types successfully. The changes are made only for `StructType`. Whole bad JSON records are placed into the corrupt column specified by the `columnNameOfCorruptRecord` option or SQL config.

Partial results are not returned for malformed JSON input.

## How was this patch tested?

Added new UT which checks converting JSON strings with one invalid and one valid field at the end of the string.

Closes apache#23253 from MaxGekk/json-bad-record.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the json-bad-record branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants