Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add least and greatest functions to functions_comparison.yml #247

Merged
merged 5 commits into from
Oct 31, 2023

Conversation

richtia
Copy link
Member

@richtia richtia commented Jul 15, 2022

PR to add functions for least and greatest.

@jvanstraten
Copy link
Contributor

I feel like string could use some more details in the description. Case sensitivity? Lexicographic vs natural? etc. Float too maybe for NaN behavior.

@westonpace
Copy link
Member

Hmm...the entire topic of "comparison" probably deserves a nice hefty block of prose somewhere (perhaps on the site itself). Otherwise we are at risk of repeating ourselves all over the YAML. The same goes for overflow, overflow vs NaN, etc.

I wonder if we want a section on the website for "functions" where some of this text can live.

@ianmcook
Copy link
Contributor

In PostgreSQL, least and greatest return null only if all the arguments evaluate to null. I think that's the behavior we want consumers to implement, so we should say something about that in the descriptions. (Because I believe in some other databases/engines, least and greatest return null if any argument is null.)

@richtia
Copy link
Member Author

richtia commented Jul 15, 2022

I feel like string could use some more details in the description. Case sensitivity? Lexicographic vs natural? etc. Float too maybe for NaN behavior.

Do you or @westonpace have a suggestion for how to handle the NaN behavior?

@jvanstraten
Copy link
Contributor

Hmm...the entire topic of "comparison" probably deserves a nice hefty block of prose somewhere (perhaps on the site itself).

Isn't this kind of behavior up to the extension that specifies the function, though? I could imagine different engines having subtly different native ways of doing comparisons, in which case conforming with Substrait's "defaults" might cost performance. You'd then want to have the option to override Substrait's defaults by just using a different function.

The alternative is associating ordering information with types instead, or at least default ordering information. SortRel actually already requires this for the default ascending/descending sorts, but beyond how nulls are to be ordered it leaves ordering up to the imagination of the user. Also, for the sort-by-function method it leaves the function signature (return type?) and the behavior for return values outside of [-1, 0, 1] unspecified.

Digression: personally I don't like having these SQL-esque default "ascending/descending" sorts at all; the implication that all types should have exactly one default ordering method seems odd to me. There is no logical way to order a 2D coordinate, for instance: you could just order by X first and then by Y, but that's as meaningless as any other sort order (Y first, by polar coordinates, by Hilbert curve, whatever). Instead, I'd much prefer having only the "custom function identifier" and "clustered" methods. If it were up to me I'd deprecate/remove the default orderings and define something like this instead:

  • By function: compare(T, T) -> i8 or less_than(T, T) -> boolean
  • By function, nulls first: compare(T!, T!) -> i8 or less_than(T!, T!) -> boolean
  • By function, nulls last: compare(T!, T!) -> i8 or less_than(T!, T!) -> boolean
  • Clustered by implicit identity function (I don't really like it, but implicit identity is used all over the place already anyway, for example in set relations)
  • Clustered using equality or comparison function: compare(T, T) -> i8 or equals(T, T) -> boolean

where by T! I mean the non-nullable version of the nullable type that's being sorted, so you only need to define ordering functions for non-nullable types without loss of generality.

If I'm not the only one who feels this way I can escalate this to an issue or PR.

Otherwise we are at risk of repeating ourselves all over the YAML.

Personally I'd much rather documentation be repeated ad nauseam than only be specified in one place where someone might not find it. It requires more maintenance, but no one wants to or can be expected to scour the complete documentation for clues when they need one specific piece of information, especially when (at present) odds are that no one has thought about it yet at all, or at least has written it down anywhere. Linking to the single point of truth would also be fine (or better) but also requires maintenance to keep the links live.

@westonpace
Copy link
Member

If I'm not the only one who feels this way I can escalate this to an issue or PR.

This pushes the burden of defining how types are sorted out of the spec and into the producers. However, the communication between producer & consumer would be very clear at that point which I believe is the point of the spec. This seems very similar to the implicit cast discussion. However, as someone working primarily on a consumer it is easy for me to say "push it all to the producer" 😆

@ianmcook @cpcloud thoughts?

Personally I'd much rather documentation be repeated ad nauseam than only be specified in one place where someone might not find it.

There should always be links/pointers to the comprehensive documentation. Yet I'd like to avoid copy/pasting entire paragraphs.

@jacques-n
Copy link
Contributor

WRT to the sorting discussion specifically, there are two options in the spec:

You choose the structured sql type sorts with asc/desc and nulls first/nulls last OR you choose a specific function to reference.

https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L857

The intention I had was to formally declare a default comparison operation within the Substrait spec for known types but also allow one to use any function one wants for sorting using a direct function reference that is specified as returning -1, 0 or 1. We should add more content to the formalization of this but I feel like it allows for arbitrary alternative collations, etc while also having a more meaningful representation. I'd also be open to enhancing this so that if you choose the asc/desc/nf/nl paradigm, we could have multiple default collations to avoid having to use opaque function references if you're non-default.

@richtia
Copy link
Member Author

richtia commented Jul 19, 2022

WRT to the sorting discussion specifically, there are two options in the spec:

You choose the structured sql type sorts with asc/desc and nulls first/nulls last OR you choose a specific function to reference.

https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L857

The intention I had was to formally declare a default comparison operation within the Substrait spec for known types but also allow one to use any function one wants for sorting using a direct function reference that is specified as returning -1, 0 or 1. We should add more content to the formalization of this but I feel like it allows for arbitrary alternative collations, etc while also having a more meaningful representation. I'd also be open to enhancing this so that if you choose the asc/desc/nf/nl paradigm, we could have multiple default collations to avoid having to use opaque function references if you're non-default.

WRT to the sorting discussion specifically, there are two options in the spec:

You choose the structured sql type sorts with asc/desc and nulls first/nulls last OR you choose a specific function to reference.

https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L857

The intention I had was to formally declare a default comparison operation within the Substrait spec for known types but also allow one to use any function one wants for sorting using a direct function reference that is specified as returning -1, 0 or 1. We should add more content to the formalization of this but I feel like it allows for arbitrary alternative collations, etc while also having a more meaningful representation. I'd also be open to enhancing this so that if you choose the asc/desc/nf/nl paradigm, we could have multiple default collations to avoid having to use opaque function references if you're non-default.

Do you have a suggestion of how to handle the sorting in the yaml spec. For example, if the default were lexicographical, how would I specify a natural sorting option?

Maybe for this PR we could also just try to get in what the function signatures look like and we can document/follow up on the sorting expectations for types via a github issue.

@westonpace
Copy link
Member

Do you have a suggestion of how to handle the sorting in the yaml spec. For example, if the default were lexicographical, how would I specify a natural sorting option?

This is unclear to me as well. How would a custom compare function be provided to a scalar function like the ones specified here (or lt or gt)?

Maybe for this PR we could also just try to get in what the function signatures look like and we can document/follow up on the sorting expectations for types via a github issue.

A follow-up issue seems reasonable to me given we already have functions like lt or gt that rely on the comparison of their inputs.

jvanstraten
jvanstraten previously approved these changes Jul 27, 2022
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as is considering the description states how strings are to be ordered, despite the discussion surrounding ordering. I could see the string versions be superseded by something more generic at some point, though.

@richtia richtia force-pushed the add_comparison_functions branch 2 times, most recently from dc2616b to e15c063 Compare July 27, 2022 16:56
@richtia
Copy link
Member Author

richtia commented Jul 29, 2022

Needed to fix the commit messages to pass linting check

@richtia richtia requested a review from jvanstraten July 29, 2022 02:30
- value: "string"
variadic:
min: 1
return: "string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the rest of the types here?

I would add everything except:

  • interval_year
  • interval_day
  • struct
  • list
  • map

Copy link
Member Author

@richtia richtia Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added all the other ones, except for the ones you listed and boolean.

edit: I also added uuid, but not sure how much sense that one makes? Let me know if that one should be removed.

Copy link
Contributor

@jvanstraten jvanstraten Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud Why don't you consider intervals to be comparable? Substrait effectively defines them as a number of months and a number of microseconds respectively, so I don't see why they're special. I would personally argue that only UUIDs and maps make little sense to order, though it's perfectly possible to define an ordering for them (for maps because they are basically just defined to behave like list<struct<K, V>>, and both of those can be compared using tie).

IMO this should just be least(T) -> T/greatest(T) -> T. Likewise for all normal comparison functions. If an engine doesn't consider a type comparable*, they will already have had to solve this problem and return suitable errors for sort relations with sort keys that have no default comparison operation and no custom comparator function.

* and we should just define how each builtin type is to be compared in the absence of a custom comparator function somewhere in the spec, so this isn't for to the engine to decide.

Copy link
Contributor

@cpcloud cpcloud Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider intervals to be comparable because of time zone shenanigans.

Consider two intervals: 1 day and 24 hours. How do these compare?

Here's how I view this:

  • ==: only possible to implement if you know the two timestamps (with time zones) that produced the interval (if that was even how it was produced!), because a timezone change across the boundary would potentially make these two unequal. An interval whose fields are exactly equivalent should compare equal.
  • !=: Similar to ==, time zones make implementing this in any kind of "obvious" way approximately impossible
  • </>: ordering has similar problems to = and !=

Let's say we have a timestamp of 2022-01-01 12:00:00 and tomorrow we switch to DST (bear with me for the sake of example).

Looking at the result of adding the above two intervals to that timestamp:

  • 2022-01-01 12:00:00 + 1 day gives 2022-01-02 13:00:00
  • 2022-01-01 12:00:00 + 24 hours gives 2022-01-02 12:00:00

Duration-based intervals I think should be comparable, but comparing finer granularity than day with day or coarser seems too fraught.

I hope I'm just wrong here and there's a sane way to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of slipped through the cracks because I thought we had resolved this offline (more or less), but I must admit I never considered this example.

At some point in the past, when I clarified what the built-in types mean, I made the assertion that it should be allowed to store interval types using a single number. In other words, 1 day, 24 hours, 1440 minutes, or 86400 seconds, should all mean exactly the same thing. I figured we could do this because year-month and day-second were separate types already anyway (but, again, didn't consider this one). I did this to make the description of the types at all compatible with Arrow's types, because you could fairly easily construct an interval type that stores the components separately using structs anyway, and frankly, because (even after this example) I remain convinced that it's good enough.

For this particular example, I would argue that the result depends on whether you're adding the interval to a timestamp or timestamp_tz. timestamp has no timezone awareness, so in both cases you would get 2022-01-02 12:00:00. timestamp_tz instead represents "real" time, where DST just doesn't exist. You simply get the timestamp that occurs 24 hours/1 day later. When represented in this particular timezone that might result in 2022-01-01 12:00:00 -> 2022-01-02 13:00:00, but represented in UTC it might be 2022-01-01 11:00:00Z -> 2022-01-02 11:00:00Z if that timezone happened to be CET. Timezone shenanigans in general are captured by the conversion between timestamp and timestamp_tz, and need not exist anywhere else.

This leaves only the more fundamental problem that months are not always the same length in our calendar, which is already covered by having different types for year-month and day-second, and the lack of overlap between these ranges. Even leap seconds (if we would want to consider those) are defined to always happen at the end of a month.

What I don't know is whether existing query engines and SQL operations are defined with sufficient sanity to be encompassed by this logic. I'm going to hazard a guess based on recent experiences with null and say no, so maybe we need to revise these types again. But in any case, the current definition of Substrait interval types is encompassed by a number with some implicit time unit associated with it (i.e. seconds or months), which makes them trivially ordered.

@richtia richtia force-pushed the add_comparison_functions branch from edb382b to 5a66d77 Compare August 12, 2022 21:52
@richtia richtia requested a review from cpcloud August 12, 2022 21:54
@richtia richtia force-pushed the add_comparison_functions branch 3 times, most recently from afd3bcd to 3fb45ab Compare September 2, 2022 20:09
@CLAassistant
Copy link

CLAassistant commented Oct 6, 2022

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@richtia richtia requested review from cpcloud and removed request for jvanstraten November 8, 2022 20:41
Uppercase letters are less than lowercase letters.
impls:
- args:
- value: List<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the value should be a List. I think it should just be a T. Variadic means that we can execute something like:

least(4,6,8,10).

I don't understand what something like the below would mean. This is what you are currently indicating with the arg type of List<T>, min:2.

least([2,4],[5,8])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! Thanks!

@richtia
Copy link
Member Author

richtia commented Apr 14, 2023

Would it make sense to provide an option to decide when null should be returned? (all arguments are null, or any arguments are null)

@westonpace
Copy link
Member

Given there are inconsistent vendors already then yes, I think we at least need an option.

However, you could argue the return types are different between the two variations.

"Return null if any of the arguments is null" would have a return of T|? while "return null only if all of the arguments are null" would have a return type of T&?.

Given that, instead of an option, it might almost make sense to have two different functions (least and least_skip_null?)

Though, I assume it is safe to err on the side of returning something that is nullable so it would probably be ok (if a little imprecise) to handle this with an option and use T|? as the return type.

I think we need @jacques-n or @cpcloud to weigh in on #340 first.

description: >-
Returns the smallest value. Only return null if 'all' arguments evaluate to null.

String comparison is done in lexicographical ordering, one character at a time, from left to right.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By lexicographic is the assumption the "C" locale? The ordering could be different if an alternative locale is chosen.

@richtia richtia force-pushed the add_comparison_functions branch from c58b8e4 to b44bdcc Compare October 26, 2023 00:07
@richtia richtia requested a review from vbarua as a code owner October 26, 2023 00:07
@richtia richtia force-pushed the add_comparison_functions branch from 89b5cbb to ab7bcbe Compare October 26, 2023 01:05
EpsilonPrime
EpsilonPrime previously approved these changes Oct 26, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In yesterday's community meeting we discussed this PR. I think we came to the following conclusions (but I am paraphrasing here so @EpsilonPrime can feel free to correct me):

  • Being able to specify a custom comparator for sorting is a problem that affects several functions. We should not hold up this PR while we figure that out. In the meantime, we should assume that all types have a default comparison method and we should not explicitly mention how values are compared.
  • We should support both "skip null" and "don't skip null" variants as two different functions instead of one function with two options.

With that said, I think these descriptions need updated. I also think we need a greatest_skip_null.

extensions/functions_comparison.yaml Outdated Show resolved Hide resolved
extensions/functions_comparison.yaml Outdated Show resolved Hide resolved
String comparison is done in lexicographical ordering, one character at a time, from left to right.
Uppercase letters are less than lowercase letters.

There is no greatest_skip_null function because it behaves the same as greatest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this conclusion. There is no existing engine I am aware of that behaves this way. For example, in both Oracle and MySQL GREATEST(1, 3, NULL) yields NULL. The theory is that "if any one of the inputs is unknown then I cannot know which is the greatest value because it may be the unknown one"

I think we do need a greatest_skip_null variant and this variant should not skip nulls.

Co-authored-by: Weston Pace <weston.pace@gmail.com>
@richtia
Copy link
Member Author

richtia commented Oct 26, 2023

@westonpace Thanks for the suggestion! I included them and added the greatest_skip_null function

@EpsilonPrime EpsilonPrime merged commit b3071bc into substrait-io:main Oct 31, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants