Skip to content

Conversation

@yjshen
Copy link
Member

@yjshen yjshen commented Mar 3, 2022

Which issue does this PR close?

Closes #158.

Rationale for this change

To support parallel parquet reading at row group level.

What changes are included in this PR?

One extra parameter while filtering row groups using row group metadata.

The midpoint and range comparison are from parquet-mr ParquetInputSplit semantic by selecting belonged row groups.

Are there any user-facing changes?

Yes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 3, 2022
@yjshen yjshen changed the title Filter row groups by comparing midpoint with offset range Introduce ReadOptions with builder API, filter row groups that satisfy all filters, and enable filter row groups by range. Mar 4, 2022
@yjshen yjshen requested review from alamb and sunchao March 4, 2022 08:33
@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2022

Codecov Report

Attention: Patch coverage is 98.93617% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.12%. Comparing base (89ee9ac) to head (fcf10f9).

Files with missing lines Patch % Lines
parquet/src/file/serialized_reader.rs 98.93% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1389      +/-   ##
==========================================
+ Coverage   83.03%   83.12%   +0.09%     
==========================================
  Files         181      181              
  Lines       52936    53329     +393     
==========================================
+ Hits        43955    44332     +377     
- Misses       8981     8997      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yjshen
Copy link
Member Author

yjshen commented Mar 4, 2022

Integration test failure seems unrelated.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yjshen . Looks great to me!

self
}

/// Add a range predicate on filtering row groups if their midpoints are within the range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe indicate whether the start and end is inclusive or exclusive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is important. I've updated the doc.

/// Get midpoint offset for a row group
fn get_midpoint_offset(meta: &RowGroupMetaData) -> i64 {
let col = meta.column(0);
let mut offset = col.data_page_offset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For encrypted Parquet files we'll need to use file_offset but it's fine for now since it's not supported anyways.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb alamb added the api-change Changes to the arrow API label Mar 6, 2022
@alamb alamb changed the title Introduce ReadOptions with builder API, filter row groups that satisfy all filters, and enable filter row groups by range. Introduce ReadOptions with builder API, for parquet filter row groups that satisfy all filters, and enable filter row groups by range. Mar 6, 2022
@alamb alamb merged commit 2bca71e into apache:master Mar 6, 2022
@alamb
Copy link
Contributor

alamb commented Mar 6, 2022

Thanks @yjshen

@alamb alamb changed the title Introduce ReadOptions with builder API, for parquet filter row groups that satisfy all filters, and enable filter row groups by range. Replace filter_row_groups with ReadOptions in parquet SerializedFileReader Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet ArrayReader should allow reading a subset of row groups

6 participants