Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add substrait support for Interval types and literals #10646

Merged
merged 4 commits into from
May 26, 2024

Conversation

waynexia
Copy link
Member

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Support convert to/from three interval types and literals in substrait

Are these changes tested?

Yes. There is a new UT roundtrip_interval_literal that covers both type and literals.

Are there any user-facing changes?

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @waynexia - thank you

/// - days: `i32`
/// - nanoseconds: `i64`
///
/// See also [`ScalarValue::IntervalMonthDayNano`] for the literal definition in DataFusion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the interval implementation is changing in the next arrow I think: apache/arrow-rs#5769

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for informing, I'd like to help migrate to the new arrow version. BTW, is there any plan for next bump?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ETA is next week sometime apache/arrow-rs#5688

Maybe you can make a "pre-release" of DataFusion against the un-released version of arrow-rs (which @tustvold often does to make sure as a way to sanity check the release)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good 👍 The "pre-update" is #10765 (still WIP

@comphead comphead merged commit 6167ce9 into apache:main May 26, 2024
23 checks passed
@waynexia waynexia deleted the substrait-interval branch May 27, 2024 03:21
@@ -1160,6 +1162,24 @@ pub(crate) fn from_substrait_type(dt: &substrait::proto::Type) -> Result<DataTyp
"Unsupported Substrait type variation {v} of type {s_kind:?}"
),
},
r#type::Kind::UserDefined(u) => {
match u.type_reference {
INTERVAL_YEAR_MONTH_TYPE_REF => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be totally wrong, but are you sure this is how type_reference is supposed to work? I'd kind of expect the type_reference to map to an extension / MappingType::ExtensionType, kinda like function_reference.

Copy link
Contributor

@alamb alamb May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not at all sure how this is supposed to work -- when I was reviewing this PR I think I concluded it followed the existing patterns.

If you have additional information / knowledge that would help improve things I think we would welcome that help

Copy link
Member Author

@waynexia waynexia May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two "references":

https://github.com/apache/datafusion/pull/10646/files#diff-d1c5f4c37ac8286d2045acb61bee17382179469557132eb02844413b260ae41bR1440-R1441

To my understanding, type variation (like these) are for different types from one base type. And type references (like these) are for different base types, those user-defined types.

I'll submit a patch to add document for the usage tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, actually I think both references (type_variation_reference and type_reference) have the same problem - I hadn't realized it affects the type_variation_reference too. Now, if each system defines the type/variation references as consts, then plans will look compatible, but may have different meaning (nothing tells another Substrait producer/consumer that type_variation_reference = 1 on a list means a large list, for example).

I believe both should be defined as SimpleExtension files, following the schema (like here https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml) rather than as constants (kinda what

//! we make use of the [simple extensions](https://substrait.io/extensions/#simple-extensions) of substrait
already says 😅). And then (I think) in the plan itself, there'd be instances of the SimpleExtensionDeclaration messages (https://github.com/substrait-io/substrait/blob/4f5b4ac4d473c9f03f30f86eca080073d99ef1c7/proto/substrait/extensions/extensions.proto#L39), and the type_reference and type_variation_reference would link to the type_anchor and type_variation_anchor, rather than to the hard-coded constants.

That way one can (in theory, at least) teach another Substrait producer/consumer about the DataFusion extensions and keep plans compatible, or at least the systems will recognize that the plans are incompatible as it refers to extension URIs that the other system doesn't know about.

Does that make sense? I'm not 100% sure of anything I'm saying here, so I may as be understanding something wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe both should be defined as SimpleExtension files

Yes, that's right. The last (I'm aware of, at least) big piece we are missing in substrait is those *.yaml spec for all extended things and related URL settings. At present, all the things are defined in the document.

From the substrait website, we'll need a yaml parsing component to support extensions from other systems as well, if we are going to implement the ability to consume plans from external systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to file a tracking issue for tasks related to substrait support

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated at #8149

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool, so it was the plan all along :) Sg!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document PR: #10719

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…0646)

* feat: support interval types

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* impl literals

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* fix deadlink in doc

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants