-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-26303][SQL] Return partial results for bad JSON records #23253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2d2fed9
de48858
101df4e
542ae74
439f57a
77d7670
fa431ee
e13673d
2dc77ad
4311225
62a6795
b19b3e1
fc088df
07e0498
9ca9248
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -35,7 +35,9 @@ displayTitle: Spark SQL Upgrading Guide | |
|
||
- Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. | ||
|
||
- In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. | ||
- In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, the returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. | ||
|
||
- In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, the returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, if returned row contains non null fields, how do we know if the row is read from a bad JSON record or a correct JSON record? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And this behavior is also defined at some places like DataFrameReader. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the row was produced from a bad JSON record, the bad record is placed to the corrupt column otherwise the corrupt column contains null. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there is no corrupt column? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the A row itself with all null cannot be an indicator of bad record. Need an additional flag. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For such behavior change, shall we add a config to roll back to previous behavior? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And you should also update other places where defines previous behavior, like DataFrameReader. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan This PR propose similar changes as in #23120 . Could you take a look at it.
I don't think it makes sense to introduce global SQL config for this particular case. The risk of breaking users apps is low because apps logic cannot based only on presence of all nulls in row. All nulls don't differentiate bad and not-bad JSON records. From my point of view, a note in the migration guide is enough. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. Sounds reasonable. |
||
## Upgrading From Spark SQL 2.3 to 2.4 | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,16 @@ package org.apache.spark.sql.catalyst.util | |
import org.apache.spark.sql.catalyst.InternalRow | ||
import org.apache.spark.unsafe.types.UTF8String | ||
|
||
/** | ||
* Exception thrown when the underlying parser returns a partial result of parsing. | ||
* @param partialResult the partial result of parsing a bad record. | ||
* @param cause the actual exception about why the parser cannot return full result. | ||
*/ | ||
case class PartialResultException( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ur, is this intentional? It looks like @cloud-fan . Is this okay? Or, shall we use more distinguishable name like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to carry the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got it~ Thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait .. but let's just rename it if possible .. the cost of renaming is 0 but there are some benefits by that .. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean, we don't have to standardise the name but let's use another name that doesn't conflict with Java's libraries. |
||
partialResult: InternalRow, | ||
cause: Throwable) | ||
extends Exception(cause) | ||
|
||
/** | ||
* Exception thrown when the underlying parser meet a bad record and can't parse it. | ||
* @param record a function to return the record that cause the parser to fail | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does
from_csv
support it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_csv
was added recently. It didn't exist in 2.4