diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md index 525ab090ce51..95b3e7125cd4 100644 --- a/docs/source/user-guide/cli.md +++ b/docs/source/user-guide/cli.md @@ -191,7 +191,7 @@ DataFusion CLI v16.0.0 2 rows in set. Query took 0.007 seconds. ``` -## Creating external tables +## Creating External Tables It is also possible to create a table backed by files by explicitly via `CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported @@ -425,6 +425,13 @@ Available commands inside DataFusion CLI are: > \h function ``` +## Supported SQL + +In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also +supports additional statements and commands: + +[sql supported in datafusion]: sql/index.rst + - Show configuration options `SHOW ALL [VERBOSE]` @@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are: > SET datafusion.execution.batch_size to 1024; ``` +- `parquet_metadata` table function + +The `parquet_metadata` table function can be used to inspect detailed metadata +about a parquet file such as statistics, sizes, and other information. This can +be helpful to understand how parquet files are structured. + +For example, to see information about the `"WatchID"` column in the +`hits.parquet` file, you can use: + +```sql +SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size +FROM parquet_metadata('hits.parquet') +WHERE path_in_schema = '"WatchID"' +LIMIT 3; + ++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ +| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size | ++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ +| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 | +| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 | +| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 | ++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ +3 rows in set. Query took 0.053 seconds. +``` + +The returned table has the following columns for each row for each column chunk +in the file. Please refer to the [Parquet Documentation] for more information. + +[parquet documentation]: https://parquet.apache.org/ + +| column_name | data_type | Description | +| ----------------------- | --------- | --------------------------------------------------------------------------------------------------- | +| filename | Utf8 | Name of the file | +| row_group_id | Int64 | Row group index the column chunk belongs to | +| row_group_num_rows | Int64 | Count of rows stored in the row group | +| row_group_num_columns | Int64 | Total number of columns in the row group (same for all row groups) | +| row_group_bytes | Int64 | Number of bytes used to store the row group (not including metadata) | +| column_id | Int64 | ID of the column | +| file_offset | Int64 | Offset within the file that this column chunk's data begins | +| num_values | Int64 | Total number of values in this column chunk | +| path_in_schema | Utf8 | "Path" (column name) of the column chunk in the schema | +| type | Utf8 | Parquet data type of the column chunk | +| stats_min | Utf8 | The minimum value for this column chunk, if stored in the statistics, cast to a string | +| stats_max | Utf8 | The maximum value for this column chunk, if stored in the statistics, cast to a string | +| stats_null_count | Int64 | Number of null values in this column chunk, if stored in the statistics | +| stats_distinct_count | Int64 | Number of distinct values in this column chunk, if stored in the statistics | +| stats_min_value | Utf8 | Same as `stats_min` | +| stats_max_value | Utf8 | Same as `stats_max` | +| compression | Utf8 | Block level compression (e.g. `SNAPPY`) used for this column chunk | +| encodings | Utf8 | All block level encodings (e.g. `[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk | +| index_page_offset | Int64 | Offset in the file of the [`page index`], if any | +| dictionary_page_offset | Int64 | Offset in the file of the dictionary page, if any | +| data_page_offset | Int64 | Offset in the file of the first data page, if any | +| total_compressed_size | Int64 | Number of bytes the column chunk's data after encoding and compression (what is stored in the file) | +| total_uncompressed_size | Int64 | Number of bytes the column chunk's data after encoding | + ++-------------------------+-----------+-------------+ + +[`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md + ## Changing Configuration Options All available configuration options can be seen using `SHOW ALL` as described above.