Skip to content

feat: Additional Canonical Extension Types#21291

Merged
adriangb merged 14 commits intoapache:mainfrom
tobixdev:additional-canonical-extension-types
Apr 14, 2026
Merged

feat: Additional Canonical Extension Types#21291
adriangb merged 14 commits intoapache:mainfrom
tobixdev:additional-canonical-extension-types

Conversation

@tobixdev
Copy link
Copy Markdown
Contributor

@tobixdev tobixdev commented Apr 1, 2026

Which issue does this PR close?

Rationale for this change

Implements DFExtensionType for the other canonical extension types (except Parquet types).

What changes are included in this PR?

  • Implement DFExtensionType for:
    • arrow.bool8 (custom formatting with true/false)
    • arrow.fixed_shape_tensor (nu custom formatting for now)
    • arrow.json (no custom formatting for now)
    • arrow.opaque (no custom formatting for now)
    • arrow.timestamp_with_offset (custom formatting as timestamp)
    • arrow.variable_shape_tensor (no custom formatting)
  • Introduce a wrapper struct DFUuid for Uuid so that it is consistent with the other extension types. I don't know whether we truly need the wrapper structs for extension types that only have a single valid storage type based on their metadata (e.g., Bool8, Uuid) . Open for any opinions.

@paleolimbot Is this the kind of wrapper structs you imagined?

Should we also add end-to-end tests for the other extension types? Currently, we only have the UUID one from the last PR.

I think for variant supports it's best to wait for general Variant support in DataFusion as this is currently done in https://github.com/datafusion-contrib/datafusion-variant .

Are these changes tested?

Custom formatters are tested. The rest is mainly boiler plate code.

Are there any user-facing changes?

Yes more extension types and renames Uuid to DFUuid. If we plan to keep our own wrapper structs, we should merge that before releasing a version with the old Uuid so we have no breaking changes.

@tobixdev tobixdev changed the title Additional Canonical Extension Types feat: Additional Canonical Extension Types Apr 1, 2026
@github-actions github-actions bot added logical-expr Logical plan and expressions common Related to common crate labels Apr 1, 2026
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I may be missing something about the arrow-rs ExtensionType, but I'm not sure we need to impl it for these (which can be constrained to datafusion-specific behaviour).

I think datafusion-common is a great place for these now...there might be type-specific crates in the future that include the dependencies required to actually do things (e.g., the datafusion JSON contribution being discussed and perhaps a future variant), and when that's the case the type definition could maybe live there too.

Comment on lines +33 to +35
impl ExtensionType for DFBool8 {
const NAME: &'static str = Bool8::NAME;
type Metadata = <Bool8 as ExtensionType>::Metadata;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is needed (only the impl DFExtensionType bit). Does the fact that these objects implement ExtensionType allow something to be done with them that isn't otherwise possible?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this was also a thorn in my eye. I think the only thing is the default registration that requires ExtensionType + DFExtensionType.

I've removed this requirement from the default registration. I think it's better now, especially because now there is only a single ExtensionType implementation that can be used for reading/writing metadata. The DFBool8 etc. types can then be ignored by most users, except those that want to customize the registered extension types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, I think the example is a bit weird as it implements both traits in the same struct.

For example, ExtensionType::supports_data_type now returns true for all supported data types (as it should for the arrow-rs API), but it may seem unexpected as it just stores a single sotrage type. So some users could assume that it should only return true for this specific type.

Do you think it warrants splitting the example into TemperatureExtensionType and DFTemperatureExtensionType too to avoid this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just remove the ExtensionType implementation there? The storage type check can move into new() and you can remove a layer of indirection.

(It is my complaint with the arrow-rs ExtensionType trait that there is pretty much zero incentive to implement it because it doesn't help anybody actually do anything).

Copy link
Copy Markdown
Contributor Author

@tobixdev tobixdev Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a good idea! That removed the weirdness I mentioned above.

This also allowed me to change the registration to not depend on ExtensionType. Therefore, the shenanigans around DefaultExtensionTypeRegistration are no longer necessary to support extension types that do not implement the arrow-rs trait.

@tobixdev
Copy link
Copy Markdown
Contributor Author

@adriangb Sorry for pinging you! It's just that I think we should merge this (or a similar PR) within this release cycle as it contains breaking changes of DFExtensionType API. After mergeing this one, I (or @paleolimbot if you're still interessted) can polish the draft PR on casting extension types.

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb Sorry for pinging you! It's just that I think we should merge this (or a similar PR) within this release cycle as it contains breaking changes of DFExtensionType API. After mergeing this one, I (or @paleolimbot if you're still interessted) can polish the draft PR on casting extension types.

I'm happy to help but what is the best use of my time here? Do you want me to review the PR (in addition to @paleolimbot to get 2 approvals? I don't see that he's approved it yet)? Or are there high level questions to chime in on?

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for being slow to re-review and update the casting PR...my current work project is not in Rust and it's been tricky to flip my brain back and forth to this 😬 )

I think this PR represents minor but useful tweaks to the traits resulting from the tire-kicking of implementing a few more types. The key change is that a DFExtensionType is no longer strictly coupled to the arrow-rs ExtensionType (although can and does effectively reuse the existing implementations that apply to use in DataFusion). The other change is to strictly store the storage type of a DFExtensionType to the instance (it was required to construct the extension type, this just made it explicit). This makes it easier to pass around fewer objects when working with them (i.e., you can just pass around the extension type and not the extension type AND the data type everywhere).

A useful follow-up would be to map the corresponding SQL types (and I will rebase the casting PR on this and get it ready for review this week).

Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me! I look forward to seeing how e.g. DFTimestampWithOffsetCol = TimestampUtcCol will work.

metadata: <Bool8 as ExtensionType>::Metadata,
) -> Result<Self> {
Ok(Self(<Bool8 as ExtensionType>::try_new(
data_type, metadata,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the supported data types be checked here as well? Maybe add to the docstring what data_type represents (it seems DFExtensionType::storage_type is hardcoded below so it's not immediately obvious why this is passed into the constructor).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point that this is not obvious. The validation happens in <Bool8 as ExtensionType>::try_new.

) -> Result<Self> {
Ok(Self {
inner: <FixedShapeTensor as ExtensionType>::try_new(data_type, metadata)?,
storage_type: data_type.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say this is a bit unwieldy. I don't have a better proposal but wanted to note it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Users could implement their DFExtensionType without using the arrow-rs trait (like we did in our example) or provide a common function that can be used on both trait implementations. The problem in this case is that <FixedShapeTensor as ExtensionType>::try_new does some processing that I do not want to duplicate.

So I think users can work around that issue.

Comment on lines +52 to +58
fn storage_type(&self) -> DataType {
self.storage_type.clone()
}

fn serialize_metadata(&self) -> Option<String> {
self.inner.serialize_metadata()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where the metadata depends on the storage type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one could think of such a type. E.g., a type that stores the serialized DataType in the metadata and then is only compatible with this exact DataType.

I am not sure how useful that is. Any concerns regarding that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does, self.storage_type() is available here (but I'm not aware of any extension types that duplicate information in the metadata and that would otherwise be inferrable from the storage type)


/// Defines the extension type logic for the canonical `arrow.opaque` extension type.
///
/// See [`DFExtensionType`] for information on DataFusion's extension type mechanism.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to a link to the canonical extension types (e.g. https://arrow.apache.org/docs/format/CanonicalExtensions.html#opaque) it might be nice to copy a summary of the type into here for discoverability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the extension types did not have a consice high-level description that I could copy, so I've added a sentence describing the type.

I've also linked to the arrow-rs implementation that contains information regarding the metadata and so forth.

@tobixdev
Copy link
Copy Markdown
Contributor Author

Thanks! In my head Dewey already re-reviewed the PR and we were waiting for a comitter's blessing. Seems I had remembered that wrongly. Sorry about that!

I'll implement your suggestions tomorrow and I'll look into the SQL type mapping.

Comment on lines +52 to +58
fn storage_type(&self) -> DataType {
self.storage_type.clone()
}

fn serialize_metadata(&self) -> Option<String> {
self.inner.serialize_metadata()
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does, self.storage_type() is available here (but I'm not aware of any extension types that duplicate information in the metadata and that would otherwise be inferrable from the storage type)

@adriangb
Copy link
Copy Markdown
Contributor

@tobixdev good to merge?

@tobixdev
Copy link
Copy Markdown
Contributor Author

@adriangb Good to merge from my side!

@adriangb adriangb added this pull request to the merge queue Apr 14, 2026
@adriangb
Copy link
Copy Markdown
Contributor

Thanks folks! Looking forward to seeing the cast work! One other thing that comes to mind (more of an arrow-rs request): can we customize how data is encoded to JSON / CSV? E.g. for uuid it would make sense to encode it as a hex string / same as the pretty print, not as the raw bytes in base64 or something.

@paleolimbot
Copy link
Copy Markdown
Member

can we customize how data is encoded to JSON / CSV?

For CSV this is doable after the cast work (cast to string!). JSON is probably better special-cased so that the output can be written incrementally if the underlying framework supports this, but you could also get away with a cast to arrow.json (falling back to a cast to string).

@tobixdev
Copy link
Copy Markdown
Contributor Author

Me too! From what I can gather by quickly looking at the arrow-rs we would need a way of customizing the CSV writer. As it already seems to use ArrayFormatter, maybe something like a HashMap<FieldName, ArrayFormatter> could also address that.

Similar to apache/arrow-rs#8829 for the pretty-printing changes.

You beat me to an answer ;) I'll just leave it here.

Merged via the queue into apache:main with commit 3763ad4 Apr 14, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants