-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
DataFusion already gives us a way to access nested fields with this syntax field["nested_field"] (field being an arrow struct field). However I didn't find any example or documentation to do the reverse operation, i.e. creating a struct from multiple fields. A pseudo SQL query (inspired from BIGQUERY) will be something like: SELECT STRUCT(a AS field_1, b AS field_2) AS struct_name FROM table
. Is there a way to express this type of construct with DataFusion SQL?
Support for this type of functionality will be particularly helpful for dealing with data sources that support nested fields (e.g. parquet and JSON) and when the query output MUST also be stored in a nested form. Example of use cases:
- Remove privacy sensitive fields from a struct field.
- Apply some transformations on some nested fields.
- Compute new nested fields based on the existing nested fields (e.g. add country, city nested fields from an ip_address nested field).
Proposed solution:
- Add a new operator to create a field of type Struct. Simple syntax:
STRUCT (expr AS field_name, ...) AS struct_field_name
- Support an EXCEPT keyword to remove one or several fields from a struct. e.g.
SELECT STRUCT (struct_a.* EXCEPT (field_1, field_2)) FROM xyz
. This type of syntax is robust to schema evolution as new fields will be automatically captured by the wildcard operator. - Support wildcard operators to combine multiple structs into a single one. e.g.
SELECT STRUCT(struct_a.*, struct_b.*) AS combined_struct FROM xyz
. This must fail if the structures a and b have fields with the same name.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request