Infer parquet reader type based on file metadata by saurabhd336 · Pull Request #9294 · apache/pinot

saurabhd336 · 2022-08-29T17:54:50Z

This PR allows parquet reader to automatically decide b/w ParquetAvroRecordReader and ParquetNativeRecordReader based on the parquet file's metadata.
The reader config can be used to enforce the reader, but the default behaviour is to infer the type based on file schema

Reader config flags
setUseParquetNativeRecordReader(true) -> Use ParquetNativeRecordReader
setUseParquetAvroRecordReader(true) -> Use ParquetAvroRecordReader
default -> Infer reader type based on file metadata

xiangfu0 · 2022-08-30T06:40:04Z

Can you add a sample data file with a decimal field and a test to ensure the file is correctly parsed?

KKcorps · 2022-08-30T06:59:07Z

...t-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetRecordReader.java

Might throw null pointer exception here if fileKeyValueMeta is null.

xiangfu0 · 2022-08-30T08:19:50Z

...t-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetRecordReader.java

You can put this check hasAvroSchemaInParquetFile() inside org.apache.pinot.plugin.inputformat.parquet.ParquetUtils and reuse the same method inside org.apache.pinot.plugin.inputformat.parquet.ParquetUtils.getParquetAvroSchema(...)

codecov-commenter · 2022-08-30T09:28:29Z

Codecov Report

Merging #9294 (ed4d8e4) into master (d1a71d8) will decrease coverage by 2.78%.
The diff coverage is 46.56%.

❗ Current head ed4d8e4 differs from pull request most recent head 711add8. Consider uploading reports for the commit 711add8 to get more accurate results

@@             Coverage Diff              @@
##             master    #9294      +/-   ##
============================================
- Coverage     69.82%   67.04%   -2.79%     
- Complexity     4696     4824     +128     
============================================
  Files          1873     1391     -482     
  Lines         99623    72184   -27439     
  Branches      15146    11583    -3563     
============================================
- Hits          69564    48396   -21168     
+ Misses        25118    20266    -4852     
+ Partials       4941     3522    -1419

Flag	Coverage Δ
integration1	`?`
integration2	`?`
unittests1	`67.04% <46.56%> (-0.08%)`	⬇️
unittests2	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...aggregation/function/AggregationFunctionUtils.java	`80.00% <ø> (+0.51%)`	⬆️
...not/query/planner/logical/RelToStageConverter.java	`73.91% <ø> (ø)`
.../columnminmaxvalue/ColumnMinMaxValueGenerator.java	`73.68% <0.00%> (ø)`
...a/org/apache/pinot/segment/spi/ColumnMetadata.java	`80.00% <0.00%> (-20.00%)`	⬇️
...java/org/apache/pinot/segment/spi/V1Constants.java	`12.50% <ø> (ø)`
...ator/transform/function/BaseTransformFunction.java	`46.58% <20.00%> (-4.92%)`	⬇️
...java/org/apache/pinot/core/common/DataFetcher.java	`77.81% <42.30%> (-11.93%)`	⬇️
...e/pinot/core/transport/InstanceRequestHandler.java	`60.15% <60.00%> (-4.69%)`	⬇️
.../java/org/apache/pinot/query/QueryEnvironment.java	`82.75% <80.00%> (-5.00%)`	⬇️
...ator/transform/function/CastTransformFunction.java	`84.00% <100.00%> (+13.35%)`	⬆️
... and 735 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

xiangfu0

lgtm

Jackie-Jiang · 2022-08-31T22:07:47Z

The test failure might be related:

2022-08-31T07:17:28.7248119Z [ERROR] Failures: 
2022-08-31T07:17:28.7249092Z [ERROR]   ParquetRecordReaderTest.testComparison:105->testComparison:125 expected [false] but found [true]

saurabhd336 · 2022-09-01T05:41:06Z

The test failure might be related:

2022-08-31T07:17:28.7248119Z [ERROR] Failures: 
2022-08-31T07:17:28.7249092Z [ERROR]   ParquetRecordReaderTest.testComparison:105->testComparison:125 expected [false] but found [true]

@Jackie-Jiang ACK. Fixed the test and added a new test to validate file metadata based reader selection

Jackie-Jiang · 2022-09-01T18:06:08Z

Can you please modify the PR description to include the new config key for the record reader config? Also update the Pinot doc where applicable

KKcorps reviewed Aug 30, 2022

View reviewed changes

xiangfu0 reviewed Aug 30, 2022

View reviewed changes

Saurabh Dubey added 2 commits August 30, 2022 14:15

Infer parquet reader type based on file metadata

874ab7f

Review comments

cae4e5e

saurabhd336 force-pushed the parquetReaderConf branch from 0191412 to cae4e5e Compare August 30, 2022 08:45

xiangfu0 approved these changes Aug 31, 2022

View reviewed changes

Add test

711add8

Jackie-Jiang added enhancement Configuration Config changes (addition/deletion/change in behavior) labels Sep 1, 2022

Jackie-Jiang merged commit d1bad1d into apache:master Sep 1, 2022

Jackie-Jiang changed the title ~~Infer parquet reader type based on file metadata (wip)~~ Infer parquet reader type based on file metadata Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer parquet reader type based on file metadata#9294

Infer parquet reader type based on file metadata#9294
Jackie-Jiang merged 3 commits intoapache:masterfrom
saurabhd336:parquetReaderConf

saurabhd336 commented Aug 29, 2022 •

edited

Loading

Uh oh!

xiangfu0 commented Aug 30, 2022 •

edited

Loading

Uh oh!

KKcorps Aug 30, 2022

Uh oh!

xiangfu0 Aug 30, 2022

Uh oh!

saurabhd336 Aug 31, 2022

Uh oh!

codecov-commenter commented Aug 30, 2022 •

edited

Loading

Uh oh!

xiangfu0 left a comment

Uh oh!

Jackie-Jiang commented Aug 31, 2022

Uh oh!

saurabhd336 commented Sep 1, 2022

Uh oh!

Jackie-Jiang commented Sep 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

saurabhd336 commented Aug 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiangfu0 commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KKcorps Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

saurabhd336 Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang commented Aug 31, 2022

Uh oh!

saurabhd336 commented Sep 1, 2022

Uh oh!

Jackie-Jiang commented Sep 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

saurabhd336 commented Aug 29, 2022 •

edited

Loading

xiangfu0 commented Aug 30, 2022 •

edited

Loading

codecov-commenter commented Aug 30, 2022 •

edited

Loading