[WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808

sameeragarwal · 2016-03-18T01:53:15Z

Investigating (via tests) all the data types that are not supported by vectorized parquet record reader.

jodersky · 2016-03-18T02:35:39Z

Is this meant to be a PR? Looking at the code, it seems more like a wok progress.

sameeragarwal · 2016-03-18T02:54:16Z

@jodersky as part of 2.0, we are trying to extend the vectorized parquet record reader to work with all supported data types. This PR is created to scope out the change by identifying those datatypes that'd currently fall back to the default parquet-mr record reader (by running the entire test suite on it).

jodersky · 2016-03-18T02:59:29Z

Sure, but it seems somewhat premature to open a PR. Maybe you could add a [WIP] into the title?

SparkQA · 2016-03-18T03:10:03Z

Test build #53491 has finished for PR 11808 at commit e56327d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-18T06:49:31Z

Yes - let's add WIP to the title if a PR is not ready for review.

davies · 2016-03-18T18:07:37Z

@sameeragarwal At least, TimestampType is not supported.

sameeragarwal · 2016-03-18T23:05:34Z

Thanks, from the test failures, it seems like among the hive-supported primitive datatypes, DecimalType(25,5) and TimestampType are currently not supported. Created followup JIRAs: https://issues.apache.org/jira/browse/SPARK-14015 and https://issues.apache.org/jira/browse/SPARK-14016 to track the tasks. Closing this PR.

…uet reader ## What changes were proposed in this pull request? This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader` ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed. 2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (#11808) and should now pass with this change. Author: Sameer Agarwal <sameer@databricks.com> Closes #11869 from sameeragarwal/bigdecimal-parquet.

…uet reader ## What changes were proposed in this pull request? This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader` ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed. 2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (apache#11808) and should now pass with this change. Author: Sameer Agarwal <sameer@databricks.com> Closes apache#11869 from sameeragarwal/bigdecimal-parquet.

## What changes were proposed in this pull request? This PR adds support for TimestampType in the vectorized parquet reader ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed. 2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (#11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well. 3. Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes. Author: Sameer Agarwal <sameer@databricks.com> Closes #11882 from sameeragarwal/timestamp-parquet.

initial investigation

e56327d

sameeragarwal changed the title ~~[SPARK-13994][SQL] Investigate types not supported by vectorized parquet record reader~~ [SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader Mar 18, 2016

sameeragarwal changed the title ~~[SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader~~ [WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader Mar 18, 2016

sameeragarwal closed this Mar 18, 2016

sameeragarwal mentioned this pull request Mar 21, 2016

[SPARK-14016][SQL] Support high-precision decimals in vectorized parquet reader #11869

Closed

sameeragarwal mentioned this pull request Mar 22, 2016

[SPARK-14015][SQL] Support TimestampType in vectorized parquet reader #11882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808

[WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808

Uh oh!

sameeragarwal commented Mar 18, 2016

Uh oh!

jodersky commented Mar 18, 2016

Uh oh!

sameeragarwal commented Mar 18, 2016

Uh oh!

jodersky commented Mar 18, 2016

Uh oh!

SparkQA commented Mar 18, 2016

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

davies commented Mar 18, 2016

Uh oh!

sameeragarwal commented Mar 18, 2016

Uh oh!

Uh oh!

[WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808

[WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808

Uh oh!

Conversation

sameeragarwal commented Mar 18, 2016

Uh oh!

jodersky commented Mar 18, 2016

Uh oh!

sameeragarwal commented Mar 18, 2016

Uh oh!

jodersky commented Mar 18, 2016

Uh oh!

SparkQA commented Mar 18, 2016

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

davies commented Mar 18, 2016

Uh oh!

sameeragarwal commented Mar 18, 2016

Uh oh!

Uh oh!