-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is this meant to be a PR? Looking at the code, it seems more like a wok progress. |
@jodersky as part of 2.0, we are trying to extend the vectorized parquet record reader to work with all supported data types. This PR is created to scope out the change by identifying those datatypes that'd currently fall back to the default parquet-mr record reader (by running the entire test suite on it). |
Sure, but it seems somewhat premature to open a PR. Maybe you could add a [WIP] into the title? |
Test build #53491 has finished for PR 11808 at commit
|
Yes - let's add WIP to the title if a PR is not ready for review. |
@sameeragarwal At least, TimestampType is not supported. |
Thanks, from the test failures, it seems like among the hive-supported primitive datatypes, |
…uet reader ## What changes were proposed in this pull request? This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader` ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed. 2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (#11808) and should now pass with this change. Author: Sameer Agarwal <sameer@databricks.com> Closes #11869 from sameeragarwal/bigdecimal-parquet.
…uet reader ## What changes were proposed in this pull request? This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader` ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed. 2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (apache#11808) and should now pass with this change. Author: Sameer Agarwal <sameer@databricks.com> Closes apache#11869 from sameeragarwal/bigdecimal-parquet.
## What changes were proposed in this pull request? This PR adds support for TimestampType in the vectorized parquet reader ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed. 2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (#11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well. 3. Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes. Author: Sameer Agarwal <sameer@databricks.com> Closes #11882 from sameeragarwal/timestamp-parquet.
Investigating (via tests) all the data types that are not supported by vectorized parquet record reader.