-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source #24041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #103264 has finished for PR 24041 at commit
|
xuanyuanking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for me. Thanks for the nice PR description, very clear.
| .transform(_.toUpperCase(Locale.ROOT)) | ||
| .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) | ||
| .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString) | ||
| .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/spark/pull/24041/files#diff-9a6b543db706f1a90f790783d6930a13R592 nit for the doc, the INFER_AND_SAVE is no longer the default. Should we mention this in migrate guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the comments; 1) update the config description at line 592 ~ 593 and 2) sql-migration-guide-upgrade.md
docs/sql-migration-guide-upgrade.md
Outdated
| - Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from the java.time packages that based on ISO chronology (https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html). In Spark version 2.4 and earlier, those operations are performed by using the hybrid calendar (Julian + Gregorian, see https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). The changes impact on the results for dates before October 15, 1582 (Gregorian) and affect on the following Spark 3.0 API: | ||
|
|
||
| - CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`. | ||
| - CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fix the indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, I think this indentation was intentional to make the notes collapsed to group Proleptic Gregorian calendar changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
damn my editor didn't detect this.... let me revert it
|
Test build #103296 has finished for PR 24041 at commit
|
|
retest this please |
|
Test build #103302 has finished for PR 24041 at commit
|
|
Yea, very nice PR description. |
|
Test build #103321 has finished for PR 24041 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merged to master.
What changes were proposed in this pull request?
In Spark 2.1, we hit a correctness bug. When reading a Hive serde parquet table with the native parquet data source, and the actual file schema doesn't match the table schema in Hive metastore(only upper/lower case difference), the query returns 0 results.
The reason is that, the parquet reader is case sensitive. If we push down filters with column names that don't match the file physical schema case-sensitively, no data will be returned.
To fix this bug, there were 2 solutions proposed at that time:
Add a config to optionally disable parquet filter pushdown, and make parquet column pruning case insensitive.
[SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution #16797
Infer the actual schema from data files, when reading Hive serde table with native data source. A config is provided to disable it.
[SPARK-19611][SQL] Introduce configurable table schema inference #17229
Solution 2 was accepted and merged to Spark 2.1.1
In Spark 2.4, we refactored the parquet data source a little:
do parquet filter pushdown with the actual file schema.
[SPARK-24716][SQL] Refactor ParquetFilters #21696
make parquet filter pushdown case insensitive.
[SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet #22197
make parquet column pruning case insensitive.
[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148
With these patches, the correctness bug in Spark 2.1 no longer exists, and the schema inference becomes unnecessary.
To be safe, this PR just changes the default value to NEVER_INFER, so that users can set it back to INFER_AND_SAVE. If we don't receive any bug reports for it, we can remove the related code in the next release.
How was this patch tested?
existing tests