Add CastColumnExpr for struct-aware column casting#17773
Conversation
Thanks for the suggestion! At the moment this PR just introduces the CastColumnExpr building block plus its focused unit coverage, without wiring it into the planner or schema-adapter path that SQLLogicTest exercises. Once we hook the adapter/planner up to construct CastColumnExpr, I agree that an end-to-end SLT is the right next step and I’ll plan to cover it in that integration change. |
|
Thanks @comphead for your review. |
| hash::Hash, | ||
| sync::Arc, | ||
| }; | ||
| /// A physical expression that applies [`cast_column`] to its input. |
There was a problem hiding this comment.
What is your long term vision of CastExpr and CastColumnExpr?
It seems like CastExpr can be totally replaced with CastColumnExpr as CastColumnExpr is more general.
datafusion/datafusion/physical-expr/src/expressions/cast.rs
Lines 44 to 53 in 631f9ab
However, it seems there are a few places in the code (in the core repo and elsewhere) that look for CastExpr directly: https://github.com/search?q=repo%3Aapache%2Fdatafusion%20downcast_ref%3A%3A%3CCastExpr%3E&type=code
In other words, do you think it would be better to update the existing CastExpr to handle Fields rather than introduce a new CastColumnExpr ?
There was a problem hiding this comment.
There was a problem hiding this comment.
Sounds like a good plan, thank you
|
BTW thank you for pushing this along. I actually think moving fields down through more of DataFusion will help many things, including logical types / Arrow Extension Type support , as described in |
Which issue does this PR close?
cast_columnhelper semantics. #17760Rationale for this change
The planner sometimes needs to rewrite the physical representation of a resolved column to a new schema/field while preserving nested-structure semantics. The existing
CastExprdoesn't carry the input/target field metadata required to perform struct-aware casts (correct nested field ordering, nullability, and null-padding for missing children). Adding a dedicatedCastColumnExprlets the execution layer call intodatafusion_common::nested_struct::cast_columnand guarantees casts behave correctly for both array and scalar values.This change is focused on execution-time casting semantics (schema-aware casts) and does not attempt to modify planner/optimizer behaviour.
What changes are included in this PR?
Add new physical expression implementation:
datafusion/physical-expr/src/expressions/cast_column.rs.Defines
CastColumnExprstruct which contains:expr: Arc<dyn PhysicalExpr>— child expression producing the value to be cast.input_field: FieldRef— resolved input field metadata.target_field: FieldRef— desired output field metadata.cast_options: CastOptions<'static>— forwarded tocast_column.Implements
PhysicalExprforCastColumnExpr:data_typeandnullablereflect thetarget_field.evaluatehandles bothColumnarValue::ArrayandColumnarValue::Scalarby delegating todatafusion_common::nested_struct::cast_column, then converting results back toColumnarValue.children,with_new_children,fmt_sql,return_fieldimplemented for planner/execution compatibility.Manual
PartialEqandHashimpls to accommodate theArc<dyn PhysicalExpr>child.Export the new expression in
datafusion/physical-expr/src/expressions/mod.rs(mod cast_column;andpub use cast_column::CastColumnExpr;).Add unit tests in
cast_column.rscovering:Add module-level docs/comments describing intent and usage.
Are these changes tested?
Yes — unit tests included in
cast_column.rsexercise array & scalar cases, nested structs, missing children, null-padding and simple primitive casts. Tests added:cast_primitive_array— castInt32ArraytoInt64Array.cast_struct_array_missing_child— source struct has fields[a, b], target struct requests[a, c]and expectscto be all-null.cast_nested_struct_array— nested struct casting where inner struct adds a new child field that must be null-padded.cast_struct_scalar— casting a struct literal (scalar) and preserving result as aScalarValue::Structwith casted children.All tests are Rust unit tests using
RecordBatchandColumn/Literalhelpers already available in the crate.Are there any user-facing changes?
Public API:
CastColumnExpris exported fromdatafusion::physical_expr::expressionsand can be constructed/used by callers who build physical expression trees. This is primarily intended for internal planner/code that needs to perform schema-aware casting of resolved columns.No changes to existing SQL surface or planner rules are included in this PR — it only adds a building block for use by other components (e.g. schema rewriters or planner nodes).
No behaviour change for existing code that doesn't use
CastColumnExpr.