-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata_size_hint for optimistic fetching of parquet metadata #2946
Conversation
Build failure seems to be infra related: |
I have filed #2947 to track the CI failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thinkharderdev -- this looks great.
I think we should handle the case when size_hint
is greater than the file_size (but perhaps I am misreading the code) before merging this but otherwise 👍
cc @tustvold
@@ -298,13 +327,20 @@ pub(crate) async fn fetch_parquet_metadata( | |||
))); | |||
} | |||
|
|||
let footer_start = meta.size - 8; | |||
let footer_start = if let Some(size_hint) = size_hint { | |||
meta.size - size_hint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if size_hit is larger than the file size?
Maybe we could used saturating_sub
here instead: https://doc.rust-lang.org/std/primitive.usize.html#method.saturating_sub
assert_eq!(c1_stats.null_count, Some(1)); | ||
assert_eq!(c2_stats.null_count, Some(3)); | ||
|
||
// Use the file size as the hint so we can get the full metadata from the first fetch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is any way to test that there was a single request made to the object store as well 🤔
@@ -567,6 +575,90 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[derive(Debug)] | |||
struct RequestCountingObjectStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
I think if you rebase from master to pick up #2948 your CI should pass cleanly |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
… request counts in unit test
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
994a587
to
d027f16
Compare
Codecov Report
@@ Coverage Diff @@
## master #2946 +/- ##
==========================================
- Coverage 85.43% 85.42% -0.01%
==========================================
Files 275 275
Lines 49739 49839 +100
==========================================
+ Hits 42495 42576 +81
- Misses 7244 7263 +19
|
Benchmark runs are scheduled for baseline = b49093c and contender = 834924f. 834924f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Related to apache/arrow-rs#2110
Rationale for this change
Add a configurable size hint to
ParquetFormat
to allow fetching metadata without multiple seeks. This is not set by default. If it is configured, theParquetFileReader
will readmetadata_size_hint
bytes from the end of the parquet file and then fetch additional data only if the metadata is more thanmetadata_size_hint
What changes are included in this PR?
Are there any user-facing changes?
Both
ParquetFormat
andParquetExec
can not take an optionalmetadata_size_hint
.