Skip to content

Commit

Permalink
Minor: Document parquet_metadata function (#8852)
Browse files Browse the repository at this point in the history
* Document parquet_metadata function

Co-authored-by: comphead <comphead@users.noreply.github.com>

---------

Co-authored-by: comphead <comphead@users.noreply.github.com>
  • Loading branch information
alamb and comphead authored Jan 14, 2024
1 parent a461c33 commit 1dcdcd4
Showing 1 changed file with 68 additions and 1 deletion.
69 changes: 68 additions & 1 deletion docs/source/user-guide/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ DataFusion CLI v16.0.0
2 rows in set. Query took 0.007 seconds.
```
## Creating external tables
## Creating External Tables
It is also possible to create a table backed by files by explicitly
via `CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported
Expand Down Expand Up @@ -425,6 +425,13 @@ Available commands inside DataFusion CLI are:
> \h function
```
## Supported SQL
In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also
supports additional statements and commands:
[sql supported in datafusion]: sql/index.rst
- Show configuration options
`SHOW ALL [VERBOSE]`
Expand Down Expand Up @@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are:
> SET datafusion.execution.batch_size to 1024;
```
- `parquet_metadata` table function
The `parquet_metadata` table function can be used to inspect detailed metadata
about a parquet file such as statistics, sizes, and other information. This can
be helpful to understand how parquet files are structured.
For example, to see information about the `"WatchID"` column in the
`hits.parquet` file, you can use:
```sql
SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.
```
The returned table has the following columns for each row for each column chunk
in the file. Please refer to the [Parquet Documentation] for more information.
[parquet documentation]: https://parquet.apache.org/
| column_name | data_type | Description |
| ----------------------- | --------- | --------------------------------------------------------------------------------------------------- |
| filename | Utf8 | Name of the file |
| row_group_id | Int64 | Row group index the column chunk belongs to |
| row_group_num_rows | Int64 | Count of rows stored in the row group |
| row_group_num_columns | Int64 | Total number of columns in the row group (same for all row groups) |
| row_group_bytes | Int64 | Number of bytes used to store the row group (not including metadata) |
| column_id | Int64 | ID of the column |
| file_offset | Int64 | Offset within the file that this column chunk's data begins |
| num_values | Int64 | Total number of values in this column chunk |
| path_in_schema | Utf8 | "Path" (column name) of the column chunk in the schema |
| type | Utf8 | Parquet data type of the column chunk |
| stats_min | Utf8 | The minimum value for this column chunk, if stored in the statistics, cast to a string |
| stats_max | Utf8 | The maximum value for this column chunk, if stored in the statistics, cast to a string |
| stats_null_count | Int64 | Number of null values in this column chunk, if stored in the statistics |
| stats_distinct_count | Int64 | Number of distinct values in this column chunk, if stored in the statistics |
| stats_min_value | Utf8 | Same as `stats_min` |
| stats_max_value | Utf8 | Same as `stats_max` |
| compression | Utf8 | Block level compression (e.g. `SNAPPY`) used for this column chunk |
| encodings | Utf8 | All block level encodings (e.g. `[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk |
| index_page_offset | Int64 | Offset in the file of the [`page index`], if any |
| dictionary_page_offset | Int64 | Offset in the file of the dictionary page, if any |
| data_page_offset | Int64 | Offset in the file of the first data page, if any |
| total_compressed_size | Int64 | Number of bytes the column chunk's data after encoding and compression (what is stored in the file) |
| total_uncompressed_size | Int64 | Number of bytes the column chunk's data after encoding |
+-------------------------+-----------+-------------+
[`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
## Changing Configuration Options
All available configuration options can be seen using `SHOW ALL` as described above.
Expand Down

0 comments on commit 1dcdcd4

Please sign in to comment.