-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-26263][SQL] Validate partition values with user provided schema #23215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this new behavior makes more sense, but we need to add a migration guide. |
Test build #99657 has finished for PR 23215 at commit
|
4719765
to
272bb1d
Compare
Test build #99671 has finished for PR 23215 at commit
|
Test build #99672 has finished for PR 23215 at commit
|
Test build #99681 has finished for PR 23215 at commit
|
8e1653b
to
b24134a
Compare
Test build #99755 has finished for PR 23215 at commit
|
Test build #99754 has finished for PR 23215 at commit
|
Test build #99753 has finished for PR 23215 at commit
|
retest this please. |
+1 from me too. It makes sense to me and having configuration sounds good. |
val columnValue = columnValueLiteral.eval() | ||
val castedValue = Cast(columnValueLiteral, dataType, Option(timeZone.getID)).eval() | ||
if (validatePartitionColumns && columnValue != null && castedValue == null) { | ||
throw new RuntimeException(s"Failed to cast partition value `$columnValue` to $dataType") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also show columnName
in this exception message? So it is easier to know which partition column has such error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Thanks.
Sounds good to me too. As there is a config, it is good that we can still disable it. |
Test build #99761 has finished for PR 23215 at commit
|
@@ -95,6 +95,31 @@ class FileIndexSuite extends SharedSQLContext { | |||
} | |||
} | |||
|
|||
test("SPARK-26263: Throw exception when partition value can't be converted to specific type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't be casted to user-specified type
.doc("When this option is set to true, partition column values will be validated with " + | ||
"provided schema. If the validation fails, a runtime exception is thrown." + | ||
"When this option is set to false, the partition column value will be converted to null " + | ||
"if it can not be converted to corresponding provided schema.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... can not be casted to ...
if (validatePartitionColumns && columnValue != null && castedValue == null) { | ||
throw new RuntimeException(s"Failed to cast value `$columnValue` to `$dataType` " + | ||
s"for partition column `$columnName`") | ||
} | ||
Literal.create(castedValue, userSpecifiedDataTypes(columnName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Literal.create(castedValue, dataType)
Test build #99770 has finished for PR 23215 at commit
|
retest this please. |
Test build #99766 has finished for PR 23215 at commit
|
Test build #99769 has finished for PR 23215 at commit
|
Test build #99776 has finished for PR 23215 at commit
|
thanks, merging to master! |
Thank you @cloud-fan @viirya @HyukjinKwon . |
## What changes were proposed in this pull request? Currently if user provides data schema, partition column values are converted as per it. But if the conversion failed, e.g. converting string to int, the column value is null. This PR proposes to throw exception in such case, instead of converting into null value silently: 1. These null partition column values doesn't make sense to users in most cases. It is better to show the conversion failure, and then users can adjust the schema or ETL jobs to fix it. 2. There are always exceptions on such conversion failure for non-partition data columns. Partition columns should have the same behavior. We can reproduce the case above as following: ``` /tmp/testDir ├── p=bar └── p=foo ``` If we run: ``` val schema = StructType(Seq(StructField("p", IntegerType, false))) spark.read.schema(schema).csv("/tmp/testDir/").show() ``` We will get: ``` +----+ | p| +----+ |null| |null| +----+ ``` ## How was this patch tested? Unit test Closes apache#23215 from gengliangwang/SPARK-26263. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Currently if user provides data schema, partition column values are converted as per it. But if the conversion failed, e.g. converting string to int, the column value is null.
This PR proposes to throw exception in such case, instead of converting into null value silently:
We can reproduce the case above as following:
If we run:
We will get:
How was this patch tested?
Unit test