Skip to content

[WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sameeragarwal
Copy link
Member

Investigating (via tests) all the data types that are not supported by vectorized parquet record reader.

@sameeragarwal sameeragarwal changed the title [SPARK-13994][SQL] Investigate types not supported by vectorized parquet record reader [SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader Mar 18, 2016
@jodersky
Copy link
Member

Is this meant to be a PR? Looking at the code, it seems more like a wok progress.

@sameeragarwal
Copy link
Member Author

@jodersky as part of 2.0, we are trying to extend the vectorized parquet record reader to work with all supported data types. This PR is created to scope out the change by identifying those datatypes that'd currently fall back to the default parquet-mr record reader (by running the entire test suite on it).

@sameeragarwal sameeragarwal changed the title [SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader [WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader Mar 18, 2016
@jodersky
Copy link
Member

Sure, but it seems somewhat premature to open a PR. Maybe you could add a [WIP] into the title?

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53491 has finished for PR 11808 at commit e56327d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 18, 2016

Yes - let's add WIP to the title if a PR is not ready for review.

@davies
Copy link
Contributor

davies commented Mar 18, 2016

@sameeragarwal At least, TimestampType is not supported.

@sameeragarwal
Copy link
Member Author

Thanks, from the test failures, it seems like among the hive-supported primitive datatypes, DecimalType(25,5) and TimestampType are currently not supported. Created followup JIRAs: https://issues.apache.org/jira/browse/SPARK-14015 and https://issues.apache.org/jira/browse/SPARK-14016 to track the tasks. Closing this PR.

asfgit pushed a commit that referenced this pull request Mar 22, 2016
…uet reader

## What changes were proposed in this pull request?

This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader`

## How was this patch tested?

1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed.
2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (#11808) and should now pass with this change.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11869 from sameeragarwal/bigdecimal-parquet.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…uet reader

## What changes were proposed in this pull request?

This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader`

## How was this patch tested?

1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed.
2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (apache#11808) and should now pass with this change.

Author: Sameer Agarwal <sameer@databricks.com>

Closes apache#11869 from sameeragarwal/bigdecimal-parquet.
asfgit pushed a commit that referenced this pull request Mar 23, 2016
## What changes were proposed in this pull request?

This PR adds support for TimestampType in the vectorized parquet reader

## How was this patch tested?

1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (#11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well.
3.  Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11882 from sameeragarwal/timestamp-parquet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants