Skip to content

[SPARK-14015][SQL] Support TimestampType in vectorized parquet reader #11882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

This PR adds support for TimestampType in the vectorized parquet reader

How was this patch tested?

  1. VectorizedColumnReader initially had a gating condition on primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96) that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
  2. The ParquetHadoopFsRelationSuite (that tests for all supported hive types -- including TimestampType) fails when the gating condition is removed ([WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808) and should now pass with this change. Similarly, the ParquetHiveCompatibilitySuite.SPARK-10177 timestamp test that fails when the gating condition is removed, should now pass as well.
  3. Added tests in HadoopFsRelationTest that test both the dictionary encoded and non-encoded versions across all supported datatypes.

@sameeragarwal
Copy link
Member Author

cc @nongli

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53757 has finished for PR 11882 at commit a612f8b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case INT96:
if (column.dataType() == DataTypes.TimestampType) {
for (int i = rowId; i < rowId + num; ++i) {
Binary v = dictionary.decodeToBinary(dictionaryIds.getInt(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm -- maybe we can do something a lot cheaper? At the very least maybe we can remove the creation of the this Binary object, since we are turning it immediately into a Long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that sounds good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO for converting the dictionary of binaries into long to make it cheaper

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53759 has finished for PR 11882 at commit 3656dae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -308,7 +322,7 @@ private void readIntBatch(int rowId, int num, ColumnVector column) throws IOExce

private void readLongBatch(int rowId, int num, ColumnVector column) throws IOException {
// This is where we implement support for the valid type conversions.
if (column.dataType() == DataTypes.LongType ||
if (column.dataType() == DataTypes.LongType || column.dataType() == DataTypes.TimestampType ||
DecimalType.is64BitDecimalType(column.dataType())) {
defColumn.readLongs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. How does this work? readLongs is expecting to read parquet int64 physical types. How is it able to read this other physical type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests where this type is not dictionary encoded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed. Also added tests in HadoopFsRelationTest that test both the dictionary encoded and non-encoded versions across all supported datatypes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53829 has finished for PR 11882 at commit 1f511ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -259,8 +259,6 @@ private void initializeInternal() throws IOException {
if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) {
throw new IOException("Complex types not supported.");
}
PrimitiveType primitiveType = t.asPrimitiveType();

originalTypes[i] = t.getOriginalType();

// TODO: Be extremely cautious in what is supported. Expand this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this check now too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sounds good.

@nongli
Copy link
Contributor

nongli commented Mar 23, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53940 has finished for PR 11882 at commit 00b1f12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 0a64294 Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants