feat: Replace DataType with FieldRef in Cast#18120
feat: Replace DataType with FieldRef in Cast#18120Weijun-H wants to merge 7 commits intoapache:mainfrom
DataType with FieldRef in Cast#18120Conversation
1598f94 to
2a06d44
Compare
DataType with FieldRef in Cast
40f0012 to
bd29587
Compare
|
Apologies...I missed this PR! Feel free to lift anything useful from #18136 (I'm just as happy to review this instead). |
6563998 to
7b32f01
Compare
|
(I did clean up #18136 because I had some time on Friday but also feel free to keep going here and I'm happy to review!) |
Hi @paleolimbot , this pr is ready for review |
DataType with FieldRef in CastDataType with FieldRef in Cast
paleolimbot
left a comment
There was a problem hiding this comment.
Nice! I have a few high-level concerns but am happy to iterate. Another option is to pursue #18136 and implement the nullability handling here (which I didn't implement) as a follow-on!
- The
DataTypeExt::into_nullable_field_ref()I think will help eliminate the need for theArc::new(Field::new("", ...))changes. - I think that the
cast()andCast::new()signatures should probably remain unchanged to minimize disruption in this PR and downstream - Considering the nullability of the cast field in the logical expression is problematic (highighted by the CI failures here) because all existing casts simply propagate the input expression nullability. I think this is what most people mean by a cast and so we should keep that behaviour. If we were to express a cast with explicit nullability we would have to do something like
Option<bool>(where None means propagate the input, true means explicitly cast to nullable, and false means explicitly cast to non-nullable). All of those would require integration with the physical expression implementing the cast to ensure output schema consistency (which is the reason for the CI failure I believe)
| Expr::Cast(Cast { expr, data_type }) => { | ||
| write!(f, "CAST({expr} AS {data_type})") | ||
| Expr::Cast(Cast { expr, field }) => { | ||
| write!(f, "CAST({expr} AS {})", field.data_type()) |
There was a problem hiding this comment.
There's now a format_type_and_metadata() helper in datafusion_common that will ensure that casts with metadata are displayed. That display is a bit verbose at the moment when metadata exists but it makes it quite a bit easier to track down bugs where the metadata was dropped.
| let expr = Expr::Cast(Cast { | ||
| expr: Box::new(Expr::Literal(ScalarValue::Float32(Some(1.23)), None)), | ||
| data_type: DataType::Utf8, | ||
| field: Arc::new(Field::new("cast", DataType::Utf8, false)), |
There was a problem hiding this comment.
| field: Arc::new(Field::new("cast", DataType::Utf8, false)), | |
| field: DataType::Utf8.into_nullable_field_ref(), |
@alamb kindly added a helper for this (there are many places in this PR that would benefit!)
An aside is that I think we should ensure that these fields are always unnamed for consistency (and that the names of such a field are never used).
| /// Types that can be used to describe the result of a cast expression. | ||
| pub trait IntoCastField { | ||
| fn into_cast_field(self, expr: &Expr) -> FieldRef; | ||
| } |
There was a problem hiding this comment.
These are somewhat similar to the FieldExt traits that Andrew added to datafusion_common/src/datatypes.rs. I think DataFusion would benefit if it had a way to say "get me a FieldRef that is actually a data type", but I am also not sure this PR is the right place to do that.
| pub fn cast(expr: Expr, data_type: DataType) -> Expr { | ||
| Expr::Cast(Cast::new(Box::new(expr), data_type)) | ||
| pub fn cast<F>(expr: Expr, field: F) -> Expr | ||
| where | ||
| F: IntoCastField, | ||
| { | ||
| let field = field.into_cast_field(&expr); | ||
| Expr::Cast(Cast::new(Box::new(expr), field)) |
There was a problem hiding this comment.
I see that a number of references to cast() were updated as cast(Arc::new(Field::new(...)))...I wonder if those can be reverted (in theory IntoCastField should handle the conversion?)
| let (_, input_field) = expr.to_field(schema)?; | ||
| let mut combined_metadata = FieldMetadata::from(input_field.metadata()); | ||
| combined_metadata.extend(FieldMetadata::from(field.metadata())); | ||
| let field = combined_metadata.add_to_field(field.as_ref().clone()); | ||
| Ok(Arc::new(field)) |
There was a problem hiding this comment.
I would prefer not to combine metadata here...this would mean that a Cast for an extension type would basically be a "reinterpret cast" not a "logical cast", which is what I believe the Cast is trying to represent in a logical plan. The Alias can currently add metadata in this way which I think is a bit fishy but is at least not attempting to represent a logical operation.
There was a problem hiding this comment.
I think many of these changes should be reverted (I don't think this file is generated?)
| datafusion_common.ArrowType arrow_type = 2; | ||
| datafusion_common.Field field = 2; |
There was a problem hiding this comment.
It is also possible to add metadata + nullability here to keep the protobuf backward compatible (not a requirement but is relatively easy to do)
| pub fn new(expr: Box<Expr>, data_type: DataType) -> Self { | ||
| Self { expr, data_type } | ||
| pub fn new(expr: Box<Expr>, field: FieldRef) -> Self { | ||
| Self { expr, field } |
There was a problem hiding this comment.
This change (and the corresponding change for TryCast::new() seem to have had caused significant disruption...it is probably worth keeping the original signature and adding a new function specifically for casts with metadata and/or nullability to minimize the disruption in DataFusion and downstream.
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?