-
Notifications
You must be signed in to change notification settings - Fork 973
Add an example to inspect parquet files and dump row group and page level metadata information #20117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.12
Are you sure you want to change the base?
Add an example to inspect parquet files and dump row group and page level metadata information #20117
Conversation
* @brief Compute page row counts and page row offsets and column chunk page (count) offsets for a | ||
* given column index | ||
*/ | ||
[[nodiscard]] auto compute_page_row_counts_and_offsets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function modified from page_index_filter.cu
|
||
cudf::host_span<uint8_t const> fetch_footer_bytes(cudf::host_span<uint8_t const> buffer) | ||
{ | ||
CUDF_FUNC_RANGE(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fetch_xx
functions copied as-is from hybrid_scan_common.cpp
page_index_bytes.size()); | ||
} | ||
|
||
std::tuple<cudf::io::parquet::FileMetaData, bool> read_parquet_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function modified from hybrid_scan_common.cpp
return std::tuple{reader->parquet_metadata(), has_page_index}; | ||
} | ||
|
||
void write_rowgroup_metadata(cudf::io::parquet::FileMetaData const& metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function modified from hybrid_scan_helpers.cpp
and reader_impl_helpers.cpp
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Description
This PR adds a new libcudf example called
parquet_inspect
which extracts useful row group and page level metadata information from the input parquet file's footer and page index (if available) and writes it to respective new parquet files.At the row group level, the written file contains three columns containing starting row offset, row counts and byte offset within the file for each row group
At the page-level, the written file contains three columns per input column (column in the input file) containing lists of page-level row offsets, lists of page-level row counts, and lists of page-level byte offsets within the file, one list per row group.
Note that the page-level metadata is only extracted and written if the page index is available in the parquet file.
Checklist