Suggestions and problems about ArrowReaderBuilder (orParquetRecordBatchStreamBuilder)
#4674
Replies: 3 comments 4 replies
-
Beta Was this translation helpful? Give feedback.
-
|
ArrowReaderBuilder reads and provides access to the ParquetMetadata, including the page index if you enable it? I would recommend checking out DataFusion's ParquetExec which shows how these APIs can be used
I'm not sure why you got this impression, but it is not true. If you provide a RowSelection, derived from the page index or otherwise, it will use this to elide IO and decode Note: I do hope to provide better APIs for interacting with the parquet statistics in futures (#4328) but I've not had sufficient bandwidth lately |
Beta Was this translation helpful? Give feedback.
-
|
For my point 2 and 3, I have no questions now. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
new_builderpublic for more flexible operations.It's more flexible to allow user to pass
ParquetMetaDatamanually. For example:If we want to analyze
ParquetMetaDatafirst (for collecting stats, pruning row groups...), we can pass thisParquetMetaDatato build a reader directly to avoid reading it twice.If we want to prune row groups, we need to call
with_row_groupsonArrowReaderBuilder. But only if we read the parquet metadata can we know which row groups to prune.ArrowReaderOptionscontainspage_indexbutArrowReaderBuilderdoesn't use it.After reading the codes I found that neither sync and async
ParquetRecordBatchReaders can use page index to optimize IO.ArrowReaderhave different read options. And the APIs are quite confusing.We can find that if we create a reader by
ArrowReaderBuilder, we will passArrowReaderOptionsto it.However, if we want to create a sync reader,
ArrowReaderOptionswill be converted toReadOptions(https://github.com/apache/arrow-rs/blob/master/parquet/src/file/serialized_reader.rs#L172)I think there are some problems:
page_indexfromArrowReaderOptionsto constructReadOptions. Other options likeReadGroupPredicatedo not exist. So we cannot prune row groups by passing predicates if we create reader byArrowReaderBuilder.ReadOptions, which may cause async reader missing some optimizations.I think we should unify them and expose more reasonable APIs.
Beta Was this translation helpful? Give feedback.
All reactions