-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add least and greatest functions to functions_comparison.yml #247
feat: add least and greatest functions to functions_comparison.yml #247
Conversation
I feel like string could use some more details in the description. Case sensitivity? Lexicographic vs natural? etc. Float too maybe for NaN behavior. |
Hmm...the entire topic of "comparison" probably deserves a nice hefty block of prose somewhere (perhaps on the site itself). Otherwise we are at risk of repeating ourselves all over the YAML. The same goes for overflow, overflow vs NaN, etc. I wonder if we want a section on the website for "functions" where some of this text can live. |
In PostgreSQL, |
Do you or @westonpace have a suggestion for how to handle the NaN behavior? |
Isn't this kind of behavior up to the extension that specifies the function, though? I could imagine different engines having subtly different native ways of doing comparisons, in which case conforming with Substrait's "defaults" might cost performance. You'd then want to have the option to override Substrait's defaults by just using a different function. The alternative is associating ordering information with types instead, or at least default ordering information. SortRel actually already requires this for the default ascending/descending sorts, but beyond how nulls are to be ordered it leaves ordering up to the imagination of the user. Also, for the sort-by-function method it leaves the function signature (return type?) and the behavior for return values outside of [-1, 0, 1] unspecified. Digression: personally I don't like having these SQL-esque default "ascending/descending" sorts at all; the implication that all types should have exactly one default ordering method seems odd to me. There is no logical way to order a 2D coordinate, for instance: you could just order by X first and then by Y, but that's as meaningless as any other sort order (Y first, by polar coordinates, by Hilbert curve, whatever). Instead, I'd much prefer having only the "custom function identifier" and "clustered" methods. If it were up to me I'd deprecate/remove the default orderings and define something like this instead:
where by If I'm not the only one who feels this way I can escalate this to an issue or PR.
Personally I'd much rather documentation be repeated ad nauseam than only be specified in one place where someone might not find it. It requires more maintenance, but no one wants to or can be expected to scour the complete documentation for clues when they need one specific piece of information, especially when (at present) odds are that no one has thought about it yet at all, or at least has written it down anywhere. Linking to the single point of truth would also be fine (or better) but also requires maintenance to keep the links live. |
This pushes the burden of defining how types are sorted out of the spec and into the producers. However, the communication between producer & consumer would be very clear at that point which I believe is the point of the spec. This seems very similar to the implicit cast discussion. However, as someone working primarily on a consumer it is easy for me to say "push it all to the producer" 😆
There should always be links/pointers to the comprehensive documentation. Yet I'd like to avoid copy/pasting entire paragraphs. |
WRT to the sorting discussion specifically, there are two options in the spec: You choose the structured sql type sorts with asc/desc and nulls first/nulls last OR you choose a specific function to reference. https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L857 The intention I had was to formally declare a default comparison operation within the Substrait spec for known types but also allow one to use any function one wants for sorting using a direct function reference that is specified as returning -1, 0 or 1. We should add more content to the formalization of this but I feel like it allows for arbitrary alternative collations, etc while also having a more meaningful representation. I'd also be open to enhancing this so that if you choose the asc/desc/nf/nl paradigm, we could have multiple default collations to avoid having to use opaque function references if you're non-default. |
Do you have a suggestion of how to handle the sorting in the yaml spec. For example, if the default were lexicographical, how would I specify a natural sorting option? Maybe for this PR we could also just try to get in what the function signatures look like and we can document/follow up on the sorting expectations for types via a github issue. |
This is unclear to me as well. How would a custom compare function be provided to a scalar function like the ones specified here (or lt or gt)?
A follow-up issue seems reasonable to me given we already have functions like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as is considering the description states how strings are to be ordered, despite the discussion surrounding ordering. I could see the string versions be superseded by something more generic at some point, though.
dc2616b
to
e15c063
Compare
Needed to fix the commit messages to pass linting check |
extensions/functions_comparison.yaml
Outdated
- value: "string" | ||
variadic: | ||
min: 1 | ||
return: "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the rest of the types here?
I would add everything except:
interval_year
interval_day
struct
list
map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added all the other ones, except for the ones you listed and boolean.
edit: I also added uuid, but not sure how much sense that one makes? Let me know if that one should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud Why don't you consider intervals to be comparable? Substrait effectively defines them as a number of months and a number of microseconds respectively, so I don't see why they're special. I would personally argue that only UUIDs and maps make little sense to order, though it's perfectly possible to define an ordering for them (for maps because they are basically just defined to behave like list<struct<K, V>>
, and both of those can be compared using tie).
IMO this should just be least(T) -> T
/greatest(T) -> T
. Likewise for all normal comparison functions. If an engine doesn't consider a type comparable*, they will already have had to solve this problem and return suitable errors for sort relations with sort keys that have no default comparison operation and no custom comparator function.
* and we should just define how each builtin type is to be compared in the absence of a custom comparator function somewhere in the spec, so this isn't for to the engine to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't consider intervals to be comparable because of time zone shenanigans.
Consider two intervals: 1 day
and 24 hours
. How do these compare?
Here's how I view this:
==
: only possible to implement if you know the two timestamps (with time zones) that produced the interval (if that was even how it was produced!), because a timezone change across the boundary would potentially make these two unequal. An interval whose fields are exactly equivalent should compare equal.!=
: Similar to==
, time zones make implementing this in any kind of "obvious" way approximately impossible<
/>
: ordering has similar problems to=
and!=
Let's say we have a timestamp of 2022-01-01 12:00:00
and tomorrow we switch to DST (bear with me for the sake of example).
Looking at the result of adding the above two intervals to that timestamp:
2022-01-01 12:00:00 + 1 day
gives2022-01-02 13:00:00
2022-01-01 12:00:00 + 24 hours
gives2022-01-02 12:00:00
Duration-based intervals I think should be comparable, but comparing finer granularity than day with day or coarser seems too fraught.
I hope I'm just wrong here and there's a sane way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of slipped through the cracks because I thought we had resolved this offline (more or less), but I must admit I never considered this example.
At some point in the past, when I clarified what the built-in types mean, I made the assertion that it should be allowed to store interval types using a single number. In other words, 1 day, 24 hours, 1440 minutes, or 86400 seconds, should all mean exactly the same thing. I figured we could do this because year-month and day-second were separate types already anyway (but, again, didn't consider this one). I did this to make the description of the types at all compatible with Arrow's types, because you could fairly easily construct an interval type that stores the components separately using structs anyway, and frankly, because (even after this example) I remain convinced that it's good enough.
For this particular example, I would argue that the result depends on whether you're adding the interval to a timestamp
or timestamp_tz
. timestamp
has no timezone awareness, so in both cases you would get 2022-01-02 12:00:00
. timestamp_tz
instead represents "real" time, where DST just doesn't exist. You simply get the timestamp that occurs 24 hours/1 day later. When represented in this particular timezone that might result in 2022-01-01 12:00:00
-> 2022-01-02 13:00:00
, but represented in UTC it might be 2022-01-01 11:00:00Z
-> 2022-01-02 11:00:00Z
if that timezone happened to be CET. Timezone shenanigans in general are captured by the conversion between timestamp
and timestamp_tz
, and need not exist anywhere else.
This leaves only the more fundamental problem that months are not always the same length in our calendar, which is already covered by having different types for year-month and day-second, and the lack of overlap between these ranges. Even leap seconds (if we would want to consider those) are defined to always happen at the end of a month.
What I don't know is whether existing query engines and SQL operations are defined with sufficient sanity to be encompassed by this logic. I'm going to hazard a guess based on recent experiences with null and say no, so maybe we need to revise these types again. But in any case, the current definition of Substrait interval types is encompassed by a number with some implicit time unit associated with it (i.e. seconds or months), which makes them trivially ordered.
edb382b
to
5a66d77
Compare
afd3bcd
to
3fb45ab
Compare
|
extensions/functions_comparison.yaml
Outdated
Uppercase letters are less than lowercase letters. | ||
impls: | ||
- args: | ||
- value: List<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the value should be a List. I think it should just be a T. Variadic means that we can execute something like:
least(4,6,8,10).
I don't understand what something like the below would mean. This is what you are currently indicating with the arg type of List<T>
, min:2.
least([2,4],[5,8])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! Thanks!
Would it make sense to provide an option to decide when null should be returned? ( |
Given there are inconsistent vendors already then yes, I think we at least need an option. However, you could argue the return types are different between the two variations. "Return null if any of the arguments is null" would have a return of Given that, instead of an option, it might almost make sense to have two different functions ( Though, I assume it is safe to err on the side of returning something that is nullable so it would probably be ok (if a little imprecise) to handle this with an option and use I think we need @jacques-n or @cpcloud to weigh in on #340 first. |
extensions/functions_comparison.yaml
Outdated
description: >- | ||
Returns the smallest value. Only return null if 'all' arguments evaluate to null. | ||
|
||
String comparison is done in lexicographical ordering, one character at a time, from left to right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By lexicographic is the assumption the "C" locale? The ordering could be different if an alternative locale is chosen.
c58b8e4
to
b44bdcc
Compare
89b5cbb
to
ab7bcbe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In yesterday's community meeting we discussed this PR. I think we came to the following conclusions (but I am paraphrasing here so @EpsilonPrime can feel free to correct me):
- Being able to specify a custom comparator for sorting is a problem that affects several functions. We should not hold up this PR while we figure that out. In the meantime, we should assume that all types have a default comparison method and we should not explicitly mention how values are compared.
- We should support both "skip null" and "don't skip null" variants as two different functions instead of one function with two options.
With that said, I think these descriptions need updated. I also think we need a greatest_skip_null
.
extensions/functions_comparison.yaml
Outdated
String comparison is done in lexicographical ordering, one character at a time, from left to right. | ||
Uppercase letters are less than lowercase letters. | ||
|
||
There is no greatest_skip_null function because it behaves the same as greatest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this conclusion. There is no existing engine I am aware of that behaves this way. For example, in both Oracle and MySQL GREATEST(1, 3, NULL)
yields NULL
. The theory is that "if any one of the inputs is unknown then I cannot know which is the greatest value because it may be the unknown one"
I think we do need a greatest_skip_null
variant and this variant should not skip nulls.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@westonpace Thanks for the suggestion! I included them and added the greatest_skip_null function |
PR to add functions for
least
andgreatest
.