Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Document parquet_metadata function #8852

Merged
merged 2 commits into from
Jan 14, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 13, 2024

Which issue does this PR close?

Part of #7013

Rationale for this change

I was writing a blog post about DataFusion apache/arrow-site#457 and I wanted to highlight this feature that @Veeupup added in #8367 but it wasn't documented

What changes are included in this PR?

Document the function

Are these changes tested?

N/A

Are there any user-facing changes?

More docs

@alamb alamb added the documentation Improvements or additions to documentation label Jan 13, 2024
@alamb alamb marked this pull request as ready for review January 13, 2024 12:57
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jan 13, 2024
@@ -191,7 +191,7 @@ DataFusion CLI v16.0.0
2 rows in set. Query took 0.007 seconds.
```

## Creating external tables
## Creating External Tables
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanup -- the other headings are capitalized so it seemed strange that this one was not

@@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are:
> SET datafusion.execution.batch_size to 1024;
```

- `parquet_metadata` table function
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what this looks like rendered:
Screenshot 2024-01-13 at 7 58 34 AM

| stats_max | Utf8 | The minimum value for this column chunk, if stored in the statistics, cast to a string |
| stats_null_count | Int64 | Number of null values in this column chunk, if stored in the statistics |
| stats_distinct_count | Int64 | Number of distinct values in this column chunk, if stored in the statistics |
| stats_min_value | Utf8 | Same as `stats_min` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if this duplicated fields needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't honestly know why the seemingly duplicated columns are present. It was done initially to mirror duckdb which has them. Maybe we should investigate the reason why 🤔

D create table foo as select * from parquet_metadata('./benchmarks/data/hits.parquet');
D describe table foo;
┌─────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│       column_name       │ column_type │  null   │   key   │ default │  extra  │
│         varcharvarcharvarcharvarcharvarcharvarchar │
├─────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ file_name               │ VARCHAR     │ YES     │         │         │         │
│ row_group_id            │ BIGINT      │ YES     │         │         │         │
│ row_group_num_rows      │ BIGINT      │ YES     │         │         │         │
│ row_group_num_columns   │ BIGINT      │ YES     │         │         │         │
│ row_group_bytes         │ BIGINT      │ YES     │         │         │         │
│ column_id               │ BIGINT      │ YES     │         │         │         │
│ file_offset             │ BIGINT      │ YES     │         │         │         │
│ num_values              │ BIGINT      │ YES     │         │         │         │
│ path_in_schema          │ VARCHAR     │ YES     │         │         │         │
│ type                    │ VARCHAR     │ YES     │         │         │         │
│ stats_min               │ VARCHAR     │ YES     │         │         │         │
│ stats_max               │ VARCHAR     │ YES     │         │         │         │
│ stats_null_count        │ BIGINT      │ YES     │         │         │         │
│ stats_distinct_count    │ BIGINT      │ YES     │         │         │         │
│ stats_min_value         │ VARCHAR     │ YES     │         │         │         │
│ stats_max_value         │ VARCHAR     │ YES     │         │         │         │
│ compression             │ VARCHAR     │ YES     │         │         │         │
│ encodings               │ VARCHAR     │ YES     │         │         │         │
│ index_page_offset       │ BIGINT      │ YES     │         │         │         │
│ dictionary_page_offset  │ BIGINT      │ YES     │         │         │         │
│ data_page_offset        │ BIGINT      │ YES     │         │         │         │
│ total_compressed_size   │ BIGINT      │ YES     │         │         │         │
│ total_uncompressed_size │ BIGINT      │ YES     │         │         │         │
├─────────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 23 rows                                                             6 columns │
└───────────────────────────────────────────────────────────────────────────────┘

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb nice work, there are couple of minors

Co-authored-by: comphead <comphead@users.noreply.github.com>
@alamb alamb changed the title Minor: Document parquet_metadata function Minor: Document parquet_metadata function Jan 14, 2024
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @alamb

@comphead comphead merged commit 1dcdcd4 into apache:main Jan 14, 2024
5 checks passed
@alamb alamb deleted the alamb/parquet_metadata branch January 14, 2024 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants