-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Other systems support "metadata" columns when querying datasources. These metadata columns do not exist in the underlying data source but instead are related to the source
Common examples:
- Row number
- File name
- Row Group Number (parquet)
DataBricks / Spark
It appears DataBricks / spark represents this concept as a struct column _metadata column with multiple fields
https://docs.databricks.com/aws/en/ingestion/file-metadata-column
SELECT
*
,_metadata
,_metadata.file_path
,_metadata.file_name
,_metadata.file_modification_time
FROM
json.`/path/to/table/data`It looks like maybe spark/databricks used to support the input_file_name() function, but has moved to _metadata: https://pawankumarshukla1979.medium.com/tips-use-metadata-instead-of-input-file-name-function-in-databricks-runtime-10-5-and-above-b32766b0296b
DuckDB
DuckDB seems to model this as additional parameters to the read_parquet function, specifically file_name and file_row_number:
https://duckdb.org/docs/stable/data/parquet/overview#parameters
D select filename, sum(row_count) as row_count from read_parquet('/Users/adriangb/Downloads/data2/**/*_stats.parquet', filename=true) group by filename order by row_count desc limit 10;
Binder Error:
Option filename adds column "filename", but a column with this name is also in the file. Try setting a different name: filename='<filename column name>'
Related tickets for adding similar metadata features to DataFusion:
- metadata column support #13975
- Support metadata columns (
location,size,last_modified) inListingTableProvider#15173 - Add input_file_name built-in function #6051
- Add with_virtual_columns to ParquetSource for reading virtual columns #20132
- Return the "position" of rows in parquet files after performing a query. #13261
- Support extended partition cols for listing table. #18482