feat: Additional Canonical Extension Types#21291
Conversation
datafusion/common/src/types/canonical_extensions/timestamp_with_offset.rs
Outdated
Show resolved
Hide resolved
paleolimbot
left a comment
There was a problem hiding this comment.
This looks great!
I may be missing something about the arrow-rs ExtensionType, but I'm not sure we need to impl it for these (which can be constrained to datafusion-specific behaviour).
I think datafusion-common is a great place for these now...there might be type-specific crates in the future that include the dependencies required to actually do things (e.g., the datafusion JSON contribution being discussed and perhaps a future variant), and when that's the case the type definition could maybe live there too.
| impl ExtensionType for DFBool8 { | ||
| const NAME: &'static str = Bool8::NAME; | ||
| type Metadata = <Bool8 as ExtensionType>::Metadata; |
There was a problem hiding this comment.
I'm not sure this is needed (only the impl DFExtensionType bit). Does the fact that these objects implement ExtensionType allow something to be done with them that isn't otherwise possible?
There was a problem hiding this comment.
Yep, this was also a thorn in my eye. I think the only thing is the default registration that requires ExtensionType + DFExtensionType.
I've removed this requirement from the default registration. I think it's better now, especially because now there is only a single ExtensionType implementation that can be used for reading/writing metadata. The DFBool8 etc. types can then be ignored by most users, except those that want to customize the registered extension types.
There was a problem hiding this comment.
After this change, I think the example is a bit weird as it implements both traits in the same struct.
For example, ExtensionType::supports_data_type now returns true for all supported data types (as it should for the arrow-rs API), but it may seem unexpected as it just stores a single sotrage type. So some users could assume that it should only return true for this specific type.
Do you think it warrants splitting the example into TemperatureExtensionType and DFTemperatureExtensionType too to avoid this?
There was a problem hiding this comment.
Or just remove the ExtensionType implementation there? The storage type check can move into new() and you can remove a layer of indirection.
(It is my complaint with the arrow-rs ExtensionType trait that there is pretty much zero incentive to implement it because it doesn't help anybody actually do anything).
There was a problem hiding this comment.
Also a good idea! That removed the weirdness I mentioned above.
This also allowed me to change the registration to not depend on ExtensionType. Therefore, the shenanigans around DefaultExtensionTypeRegistration are no longer necessary to support extension types that do not implement the arrow-rs trait.
|
@adriangb Sorry for pinging you! It's just that I think we should merge this (or a similar PR) within this release cycle as it contains breaking changes of |
I'm happy to help but what is the best use of my time here? Do you want me to review the PR (in addition to @paleolimbot to get 2 approvals? I don't see that he's approved it yet)? Or are there high level questions to chime in on? |
paleolimbot
left a comment
There was a problem hiding this comment.
(Sorry for being slow to re-review and update the casting PR...my current work project is not in Rust and it's been tricky to flip my brain back and forth to this 😬 )
I think this PR represents minor but useful tweaks to the traits resulting from the tire-kicking of implementing a few more types. The key change is that a DFExtensionType is no longer strictly coupled to the arrow-rs ExtensionType (although can and does effectively reuse the existing implementations that apply to use in DataFusion). The other change is to strictly store the storage type of a DFExtensionType to the instance (it was required to construct the extension type, this just made it explicit). This makes it easier to pass around fewer objects when working with them (i.e., you can just pass around the extension type and not the extension type AND the data type everywhere).
A useful follow-up would be to map the corresponding SQL types (and I will rebase the casting PR on this and get it ready for review this week).
adriangb
left a comment
There was a problem hiding this comment.
Seems reasonable to me! I look forward to seeing how e.g. DFTimestampWithOffsetCol = TimestampUtcCol will work.
| metadata: <Bool8 as ExtensionType>::Metadata, | ||
| ) -> Result<Self> { | ||
| Ok(Self(<Bool8 as ExtensionType>::try_new( | ||
| data_type, metadata, |
There was a problem hiding this comment.
Should the supported data types be checked here as well? Maybe add to the docstring what data_type represents (it seems DFExtensionType::storage_type is hardcoded below so it's not immediately obvious why this is passed into the constructor).
There was a problem hiding this comment.
Good point that this is not obvious. The validation happens in <Bool8 as ExtensionType>::try_new.
| ) -> Result<Self> { | ||
| Ok(Self { | ||
| inner: <FixedShapeTensor as ExtensionType>::try_new(data_type, metadata)?, | ||
| storage_type: data_type.clone(), |
There was a problem hiding this comment.
I will say this is a bit unwieldy. I don't have a better proposal but wanted to note it.
There was a problem hiding this comment.
I agree. Users could implement their DFExtensionType without using the arrow-rs trait (like we did in our example) or provide a common function that can be used on both trait implementations. The problem in this case is that <FixedShapeTensor as ExtensionType>::try_new does some processing that I do not want to duplicate.
So I think users can work around that issue.
| fn storage_type(&self) -> DataType { | ||
| self.storage_type.clone() | ||
| } | ||
|
|
||
| fn serialize_metadata(&self) -> Option<String> { | ||
| self.inner.serialize_metadata() | ||
| } |
There was a problem hiding this comment.
Are there cases where the metadata depends on the storage type?
There was a problem hiding this comment.
I guess one could think of such a type. E.g., a type that stores the serialized DataType in the metadata and then is only compatible with this exact DataType.
I am not sure how useful that is. Any concerns regarding that?
There was a problem hiding this comment.
If it does, self.storage_type() is available here (but I'm not aware of any extension types that duplicate information in the metadata and that would otherwise be inferrable from the storage type)
|
|
||
| /// Defines the extension type logic for the canonical `arrow.opaque` extension type. | ||
| /// | ||
| /// See [`DFExtensionType`] for information on DataFusion's extension type mechanism. |
There was a problem hiding this comment.
In addition to a link to the canonical extension types (e.g. https://arrow.apache.org/docs/format/CanonicalExtensions.html#opaque) it might be nice to copy a summary of the type into here for discoverability.
There was a problem hiding this comment.
Most of the extension types did not have a consice high-level description that I could copy, so I've added a sentence describing the type.
I've also linked to the arrow-rs implementation that contains information regarding the metadata and so forth.
|
Thanks! In my head Dewey already re-reviewed the PR and we were waiting for a comitter's blessing. Seems I had remembered that wrongly. Sorry about that! I'll implement your suggestions tomorrow and I'll look into the SQL type mapping. |
| fn storage_type(&self) -> DataType { | ||
| self.storage_type.clone() | ||
| } | ||
|
|
||
| fn serialize_metadata(&self) -> Option<String> { | ||
| self.inner.serialize_metadata() | ||
| } |
There was a problem hiding this comment.
If it does, self.storage_type() is available here (but I'm not aware of any extension types that duplicate information in the metadata and that would otherwise be inferrable from the storage type)
|
@tobixdev good to merge? |
|
@adriangb Good to merge from my side! |
|
Thanks folks! Looking forward to seeing the cast work! One other thing that comes to mind (more of an |
For CSV this is doable after the cast work (cast to string!). JSON is probably better special-cased so that the output can be written incrementally if the underlying framework supports this, but you could also get away with a cast to arrow.json (falling back to a cast to string). |
|
Me too! From what I can gather by quickly looking at the arrow-rs we would need a way of customizing the CSV writer. As it already seems to use Similar to apache/arrow-rs#8829 for the pretty-printing changes. You beat me to an answer ;) I'll just leave it here. |
Which issue does this PR close?
DFExtensionTypefor Arrow's Canonical Extension Types #21144 .Rationale for this change
Implements
DFExtensionTypefor the other canonical extension types (except Parquet types).What changes are included in this PR?
DFExtensionTypefor:arrow.bool8(custom formatting with true/false)arrow.fixed_shape_tensor(nu custom formatting for now)arrow.json(no custom formatting for now)arrow.opaque(no custom formatting for now)arrow.timestamp_with_offset(custom formatting as timestamp)arrow.variable_shape_tensor(no custom formatting)DFUuidforUuidso that it is consistent with the other extension types. I don't know whether we truly need the wrapper structs for extension types that only have a single valid storage type based on their metadata (e.g., Bool8, Uuid) . Open for any opinions.@paleolimbot Is this the kind of wrapper structs you imagined?
Should we also add end-to-end tests for the other extension types? Currently, we only have the UUID one from the last PR.
I think for variant supports it's best to wait for general Variant support in DataFusion as this is currently done in https://github.com/datafusion-contrib/datafusion-variant .
Are these changes tested?
Custom formatters are tested. The rest is mainly boiler plate code.
Are there any user-facing changes?
Yes more extension types and renames
UuidtoDFUuid. If we plan to keep our own wrapper structs, we should merge that before releasing a version with the oldUuidso we have no breaking changes.