Skip to content

[EPIC] A collection of support for metadata columns in ListingTable #20135

@alamb

Description

@alamb

Other systems support "metadata" columns when querying datasources. These metadata columns do not exist in the underlying data source but instead are related to the source

Common examples:

  • Row number
  • File name
  • Row Group Number (parquet)

DataBricks / Spark

It appears DataBricks / spark represents this concept as a struct column _metadata column with multiple fields
https://docs.databricks.com/aws/en/ingestion/file-metadata-column

SELECT
    *
    ,_metadata
    ,_metadata.file_path
    ,_metadata.file_name
    ,_metadata.file_modification_time
FROM
    json.`/path/to/table/data`

It looks like maybe spark/databricks used to support the input_file_name() function, but has moved to _metadata: https://pawankumarshukla1979.medium.com/tips-use-metadata-instead-of-input-file-name-function-in-databricks-runtime-10-5-and-above-b32766b0296b

DuckDB

DuckDB seems to model this as additional parameters to the read_parquet function, specifically file_name and file_row_number:
https://duckdb.org/docs/stable/data/parquet/overview#parameters

Per @adriangb last year:

D select filename, sum(row_count) as row_count from read_parquet('/Users/adriangb/Downloads/data2/**/*_stats.parquet', filename=true) group by filename order by row_count desc limit 10;
Binder Error:
Option filename adds column "filename", but a column with this name is also in the file. Try setting a different name: filename='<filename column name>'

Related tickets for adding similar metadata features to DataFusion:

Metadata

Metadata

Assignees

No one assigned

    Labels

    PROPOSAL EPICA proposal being discussed that is not yet fully underway

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions