-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Document parquet_metadata
function
#8852
Conversation
@@ -191,7 +191,7 @@ DataFusion CLI v16.0.0 | |||
2 rows in set. Query took 0.007 seconds. | |||
``` | |||
|
|||
## Creating external tables | |||
## Creating External Tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by cleanup -- the other headings are capitalized so it seemed strange that this one was not
@@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are: | |||
> SET datafusion.execution.batch_size to 1024; | |||
``` | |||
|
|||
- `parquet_metadata` table function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| stats_max | Utf8 | The minimum value for this column chunk, if stored in the statistics, cast to a string | | ||
| stats_null_count | Int64 | Number of null values in this column chunk, if stored in the statistics | | ||
| stats_distinct_count | Int64 | Number of distinct values in this column chunk, if stored in the statistics | | ||
| stats_min_value | Utf8 | Same as `stats_min` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if this duplicated fields needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't honestly know why the seemingly duplicated columns are present. It was done initially to mirror duckdb which has them. Maybe we should investigate the reason why 🤔
D create table foo as select * from parquet_metadata('./benchmarks/data/hits.parquet');
D describe table foo;
┌─────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ file_name │ VARCHAR │ YES │ │ │ │
│ row_group_id │ BIGINT │ YES │ │ │ │
│ row_group_num_rows │ BIGINT │ YES │ │ │ │
│ row_group_num_columns │ BIGINT │ YES │ │ │ │
│ row_group_bytes │ BIGINT │ YES │ │ │ │
│ column_id │ BIGINT │ YES │ │ │ │
│ file_offset │ BIGINT │ YES │ │ │ │
│ num_values │ BIGINT │ YES │ │ │ │
│ path_in_schema │ VARCHAR │ YES │ │ │ │
│ type │ VARCHAR │ YES │ │ │ │
│ stats_min │ VARCHAR │ YES │ │ │ │
│ stats_max │ VARCHAR │ YES │ │ │ │
│ stats_null_count │ BIGINT │ YES │ │ │ │
│ stats_distinct_count │ BIGINT │ YES │ │ │ │
│ stats_min_value │ VARCHAR │ YES │ │ │ │
│ stats_max_value │ VARCHAR │ YES │ │ │ │
│ compression │ VARCHAR │ YES │ │ │ │
│ encodings │ VARCHAR │ YES │ │ │ │
│ index_page_offset │ BIGINT │ YES │ │ │ │
│ dictionary_page_offset │ BIGINT │ YES │ │ │ │
│ data_page_offset │ BIGINT │ YES │ │ │ │
│ total_compressed_size │ BIGINT │ YES │ │ │ │
│ total_uncompressed_size │ BIGINT │ YES │ │ │ │
├─────────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 23 rows 6 columns │
└───────────────────────────────────────────────────────────────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb nice work, there are couple of minors
Co-authored-by: comphead <comphead@users.noreply.github.com>
parquet_metadata
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks @alamb
Which issue does this PR close?
Part of #7013
Rationale for this change
I was writing a blog post about DataFusion apache/arrow-site#457 and I wanted to highlight this feature that @Veeupup added in #8367 but it wasn't documented
What changes are included in this PR?
Document the function
Are these changes tested?
N/A
Are there any user-facing changes?
More docs