Skip to content

How to build a field of type Struct with DataFusion SQL? #2043

@lquerel

Description

@lquerel

DataFusion already gives us a way to access nested fields with this syntax field["nested_field"] (field being an arrow struct field). However I didn't find any example or documentation to do the reverse operation, i.e. creating a struct from multiple fields. A pseudo SQL query (inspired from BIGQUERY) will be something like: SELECT STRUCT(a AS field_1, b AS field_2) AS struct_name FROM table. Is there a way to express this type of construct with DataFusion SQL?

Support for this type of functionality will be particularly helpful for dealing with data sources that support nested fields (e.g. parquet and JSON) and when the query output MUST also be stored in a nested form. Example of use cases:

  • Remove privacy sensitive fields from a struct field.
  • Apply some transformations on some nested fields.
  • Compute new nested fields based on the existing nested fields (e.g. add country, city nested fields from an ip_address nested field).

Proposed solution:

  • Add a new operator to create a field of type Struct. Simple syntax: STRUCT (expr AS field_name, ...) AS struct_field_name
  • Support an EXCEPT keyword to remove one or several fields from a struct. e.g. SELECT STRUCT (struct_a.* EXCEPT (field_1, field_2)) FROM xyz. This type of syntax is robust to schema evolution as new fields will be automatically captured by the wildcard operator.
  • Support wildcard operators to combine multiple structs into a single one. e.g. SELECT STRUCT(struct_a.*, struct_b.*) AS combined_struct FROM xyz. This must fail if the structures a and b have fields with the same name.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions