Skip to content

[SPARK-26263][SQL] Validate partition values with user provided schema #23215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Dec 4, 2018

What changes were proposed in this pull request?

Currently if user provides data schema, partition column values are converted as per it. But if the conversion failed, e.g. converting string to int, the column value is null.

This PR proposes to throw exception in such case, instead of converting into null value silently:

  1. These null partition column values doesn't make sense to users in most cases. It is better to show the conversion failure, and then users can adjust the schema or ETL jobs to fix it.
  2. There are always exceptions on such conversion failure for non-partition data columns. Partition columns should have the same behavior.

We can reproduce the case above as following:

/tmp/testDir
├── p=bar
└── p=foo

If we run:

val schema = StructType(Seq(StructField("p", IntegerType, false)))
spark.read.schema(schema).csv("/tmp/testDir/").show()

We will get:

+----+
|   p|
+----+
|null|
|null|
+----+

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

@cloud-fan

@cloud-fan
Copy link
Contributor

I think this new behavior makes more sense, but we need to add a migration guide.

@SparkQA
Copy link

SparkQA commented Dec 4, 2018

Test build #99657 has finished for PR 23215 at commit 7060e12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-26263][SQL] Throw exception when Partition column value can't be converted to user specified type [SPARK-26263][SQL] Validate partition values with user provided schema Dec 4, 2018
@SparkQA
Copy link

SparkQA commented Dec 4, 2018

Test build #99671 has finished for PR 23215 at commit 4719765.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2018

Test build #99672 has finished for PR 23215 at commit 272bb1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2018

Test build #99681 has finished for PR 23215 at commit 4060c30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99755 has finished for PR 23215 at commit ce2db28.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99754 has finished for PR 23215 at commit b24134a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99753 has finished for PR 23215 at commit 8e1653b.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 6, 2018

+1 from me too. It makes sense to me and having configuration sounds good.

val columnValue = columnValueLiteral.eval()
val castedValue = Cast(columnValueLiteral, dataType, Option(timeZone.getID)).eval()
if (validatePartitionColumns && columnValue != null && castedValue == null) {
throw new RuntimeException(s"Failed to cast partition value `$columnValue` to $dataType")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also show columnName in this exception message? So it is easier to know which partition column has such error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Thanks.

@viirya
Copy link
Member

viirya commented Dec 6, 2018

Sounds good to me too. As there is a config, it is good that we can still disable it.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99761 has finished for PR 23215 at commit ce2db28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -95,6 +95,31 @@ class FileIndexSuite extends SharedSQLContext {
}
}

test("SPARK-26263: Throw exception when partition value can't be converted to specific type") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't be casted to user-specified type

.doc("When this option is set to true, partition column values will be validated with " +
"provided schema. If the validation fails, a runtime exception is thrown." +
"When this option is set to false, the partition column value will be converted to null " +
"if it can not be converted to corresponding provided schema.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... can not be casted to ...

if (validatePartitionColumns && columnValue != null && castedValue == null) {
throw new RuntimeException(s"Failed to cast value `$columnValue` to `$dataType` " +
s"for partition column `$columnName`")
}
Literal.create(castedValue, userSpecifiedDataTypes(columnName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal.create(castedValue, dataType)

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99770 has finished for PR 23215 at commit 6f4e652.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99766 has finished for PR 23215 at commit 25f7039.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99769 has finished for PR 23215 at commit e13147a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99776 has finished for PR 23215 at commit 6f4e652.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 5a140b7 Dec 7, 2018
@gengliangwang
Copy link
Member Author

Thank you @cloud-fan @viirya @HyukjinKwon .

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Currently if user provides data schema, partition column values are converted as per it. But if the conversion failed, e.g. converting string to int, the column value is null.

This PR proposes to throw exception in such case, instead of converting into null value silently:
1. These null partition column values doesn't make sense to users in most cases. It is better to show the conversion failure, and then users can adjust the schema or ETL jobs to fix it.
2. There are always exceptions on such conversion failure for non-partition data columns. Partition columns should have the same behavior.

We can reproduce the case above as following:
```
/tmp/testDir
├── p=bar
└── p=foo
```
If we run:
```
val schema = StructType(Seq(StructField("p", IntegerType, false)))
spark.read.schema(schema).csv("/tmp/testDir/").show()
```
We will get:
```
+----+
|   p|
+----+
|null|
|null|
+----+
```

## How was this patch tested?

Unit test

Closes apache#23215 from gengliangwang/SPARK-26263.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants