Consider allowing default in quantile and median #132

bkamins · 2022-11-28T07:58:16Z

In generic code the fact that the following operations error:

julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

julia> quantile(Int[], 0.5)
ERROR: ArgumentError: empty data vector

is inconvenient (as the only work-around is try-catch). Maybe we could allow passing default kwarg that would be returned when the passed collection is empty?

The text was updated successfully, but these errors were encountered:

aplavin · 2022-11-29T09:44:50Z

Would make sense for mean as well. Currently, empty mean returns NaN for numbers and throws for non-numbers.

Btw, sum lets you specify the default instead of zero, but the argument is called init there.

bkamins · 2022-11-29T09:51:13Z

Currently, empty mean returns NaN for numbers and throws for non-numbers.

Yes, but at least it does not throw an error in standard case (but indeed we could allow for providing the other value of default as NaN, while correct is highly non-standard for Julia data processing)

called init there

The reason is that sum is a reduction, while e.g. median is not a reduction, so that is why init is not a best name in this case. But indeed - I was not sure what would be the best name of the kwarg.

aplavin · 2022-11-29T10:03:52Z

Yes, but at least it does not throw an error in standard case

Would be weird to have default in median, but not in mean. And I also wanted this functionality a couple of times!

NaN, while correct is highly non-standard for Julia data processing

Dunno, I like NaNs the majority of time, it's the only way to signal "there's no number here" in a type-stable way. I mostly wanted to use default for values like 0 or something similar, when I need an actual number.

that is why init is not a best name in this case

Yeah, init doesn't seem to fit well for mean/median, I agree. Just pointed out what exists elsewhere.

pdeffebach · 2022-11-29T14:15:27Z

FWIW, R returns NULL which is roughly equivalent to nothing.

aplavin · 2022-11-29T15:14:37Z

For mean, NaN seems the only right default choice, considering that mean(x) = sum(x) / length(x) and 0.0 / 0 is NaN.

However, if these functions all returned nothing for empty inputs, setting a default would be a straightforward something(median(x), 0.0). It's already common for Julia functions to return nothing indicating "no result" in some sense, eg findfirst-like functions.

nothing is also great in that functions deliberately don't propagate it further. So when a user gets nothing and forgets to handle it, he gets an exception very soon.

bkamins · 2022-11-29T15:44:59Z

These are all good ideas. But then the question is how to convert it to API in a non-breaking way?
We probably need a kwarg that would allow to switch behavior.

aplavin · 2022-11-29T16:33:51Z

Maybe, median/quantile can simply return nothing instead of throwing? Seems nonbreaking, and allows for easy default selection through something(...).

aplavin · 2022-11-29T16:37:34Z

Another idea is a general wrapper function so that the interface looks like onempty(median, xs, default=0.0).

bkamins · 2022-11-29T16:57:55Z

Seems nonbreaking, and allows for easy default selection through something(...).

It would be non-breaking, but I feared proposing it for such a common function 😄.

a general wrapper function so that the interface looks like onempty(median, xs, default=0.0).

This general wrapper function would require special casing for each function passed as its first argument to be efficient.

The point is that isempty(skipmissing(x)) is O(n) operation in general:

julia> x = missings(Int, 10^8);

julia> @time isempty(skipmissing(x))
  0.077613 seconds (1 allocation: 16 bytes)
true

But maybe it is good enough?

aplavin · 2022-11-29T17:15:59Z

The point is that isempty(skipmissing(x)) is O(n) operation in general

Good point! I mostly considered applying functions to an actual materialized array, not to skipmissing.

An advantage of onempty vs either median(; default) or something(median(), default) is that it makes clear when the default value is used - exactly when the input is empty. The other two variants read more ambiguous regarding when the default gets returned.

OTOH, advantages of something(median(), ...) is that it doesn't involve creating any extra functions, and is trivially efficient wrt isempty(...) checks.

bkamins · 2022-11-29T17:59:53Z

@nalimilan - what do you think?

nalimilan · 2022-11-29T20:56:14Z

I don't think there's a precedent of a function that returns nothing when the input is empty, right? So adding a default keyword argument seems simpler. It's too bad we can't call it init for consistency with sum.

aplavin · 2022-11-30T14:17:51Z

Some Julia functions throw an exception when no result is available (eg first), some do return nothing (eg findfirst).
Btw, first-or-default and only-or-default functions would also useful to have...

bkamins · 2022-11-30T14:41:28Z

In general where possible I personally prefer special value-based behavior rather than error-based. For median and quantile returning nothing is safe.

However, I understand the concern of @nalimilan that it requires a careful design so that we have one common solution rather than many special cases. E.g. nothing as a special value works for mean, median, quantile, and e.g. std, but it is not a good value for sum as:

julia> isnothing(sum([nothing]))
true

aplavin · 2022-11-30T14:53:19Z

It's unfortunate, but is a fact, that functions in Julia tend not to really consider that nothing can be contained within a collection. For example, findfirst returns nothing when it didn't find anything, or when it found the result, and its index is nothing:

# can distinguish these two:
julia> findfirst(x -> true, Dict(1 => 2))
1
julia> findfirst(x -> false, Dict(1 => 2))

# but not these two:
julia> findfirst(x -> true, Dict(nothing => 2))

julia> findfirst(x -> false, Dict(nothing => 2))

Potentially, median could reasonably be extended to non-numbers in the future, and return the only element for length-1 collections. But I personally don't see it as a major issue...

mreichMPI-BGC · 2022-12-06T20:10:21Z

Following this https://discourse.julialang.org/t/summarizing-grouped-dataframe-where-a-group-is-entirely-missing/90920/20?u=mreichmpi-bgc my 5 cents from a user perspective (and with a Tidyverse background as the most recent):

I like the general wrapper on_empty . And I think the model of purrr in Tidyverse of having wrappers for capturing side effects is quite convenient in data analysis, e.g. safely, possibly and quietly etc. Cf. Capture side effects. — safely • purrr.

In general I believe the user base will expand if such function are added, even at the expense of performance or other design aspects.

bkamins · 2022-12-17T12:21:40Z

I was thinking what is best and came to the conclusion that we could define somewhere:

withempty(f, default) = x -> isempty(x) ? default : f(x)

(name is tentative)

With skipmissing the performance will not be optimal, but I think most of the time users will not complain (you would need to have a vector of all missing values with millions of such elements to notice the problem.

In this post I have shown an example usage.

aplavin · 2022-12-17T12:52:30Z

I also like this solution, but wonder if it's possible to generalize in the following sense, while keeping simple and intuitive.

Often, one wants to compute mean/median/std/whatever only when at least several values are available. If you are interested in the distribution location, mean of a single sample isn't of much use - even more so for scale measurements such as std.
From time to time, I find myself doing things like length(x) >= 4 ? mean(x) : NaN. Conceptually, this seems very similar to the ifempty check discussed here.

Maybe, there's some clean solution to both?.. Anything to borrow from R/others?

bkamins · 2022-12-17T14:51:24Z

We could define:

withcondition(f; default=missing, condition=isempty) = x -> condition(x) ? default : f(x)

(I proposed adding defaults, but we could skip them).

Then, by default we would have "return missing if empty, but one could write withcondition(median, default=NaN, condition=x->length(x) < 5) to get what you want.

nalimilan · 2022-12-17T14:53:51Z

That sounds overkilll to me. The point of such a function is to be simple and convenient, as users can easily define other functions if needed. Actually we could almost have emptymissing(f) = x -> isempty(x) ? missing : f(x) to make things even simpler for the common case (and add that to Missings.jl).

bkamins · 2022-12-17T14:55:26Z

I would be also OK with such a solution as it likely covers most of the use cases.

jariji · 2022-12-17T19:12:04Z

After reading @bkamins blog post I came here to suggest

whenever(predicate, f, default) = x -> predicate(x) ? default : f(x)

It works for isempty, isnothing, ismissing, isnan. I like the positional version more than the kwargs version suggested above. I'm not committed to the name at the moment.

nalimilan · 2022-12-17T20:29:19Z

whenever(predicate, f, default) is only 3 characters shorter than x -> predicate(x) ? default : f(x) so I don't really see the point of adding such a general function.

bkamins · 2022-12-17T22:29:52Z

I don't really see the point of adding such a general function.

There are two points that can be considered:

you actually need to write (x -> predicate(x) ? default : f(x)) in practice, and these extra ( and ) create visual noise and are easy to forget.
x -> predicate(x) ? default : f(x) will be re-compiled every time it is written, while whenever(predicate, f, default) is compiled once

jariji · 2022-12-17T22:51:59Z

BTW, I'm not totally sold on the argument order being f, default. whenever(predicate, value, function) seems closer to the familiar if-then-else ordering. This ambiguity is the downside of positionals.

Tangential, but I don't really like that the predicate functions we have are positive in the degenerate case rather than the good case. If there were better words for nonmissing/nonempty/nonnan that could be nicer than the current system.

mreichMPI-BGC · 2022-12-18T09:18:07Z

We could define:
withcondition(f; default=missing, condition=isempty) = x -> condition(x) ? default : f(x)
(I proposed adding defaults, but we could skip them).

I really like that most. I like, when there are reasonable, intuitive defaults that cover most of the cases, but then allow the user to decide to make things as complicated as she/he wants.

If would probably like to see these three functions (at least):

catchCondition(f; condition=isempty,  default=missing) = x -> condition(x) ? default : f(x)
catchAny(f; condition=isempty,  default=missing) = x -> condition(x) ? default : try f(x) catch e  default end
catchMissing(f; catchFun=catchCondition) = x -> catchFun(f)(skipmissing(x))

The last one would just allow avoiding typing skipmissing everytime. Just for convenience, but I think important for easy usage.

I would like to be able to pass the kwargs from catchMissing also to the other functions, but somehow catchMissing(f; catchFun=catchCondition, kwargs...) did not work (sorry too new to Julia...).

Tangential, but I don't really like that the predicate functions we have are positive in the degenerate case rather than the good case. If there were better words for nonmissing/nonempty/nonnan that could be nicer than the current system.

Agree - that's why I'd suggest having "catch" in the name, which makes it more intuitive logically IMO.

whenever(predicate, f, default) = x -> predicate(x) ? default : f(x)
It works for isempty, isnothing, ismissing, isnan. I like the positional version more than the kwargs version suggested above. I'm not committed to the name at the moment.

I really like the keyword version with defaults much more. Also I think for the user it is very important to have a consistent design, e.g. here always have f at the beginning. This is what makes tidyverse so attractive I believe.

aplavin · 2022-12-18T17:31:10Z

Another idea for a general and convenient interface, that encompasses both isempty and length checks:

iflength(median, 0 => NaN)(x)  # NaN if x is empty
iflength(median, 0:3 => NaN)(x)  # NaN if length(x) is <= 3

Also, no matter what predicates are incorporated, I believe

default=missing

shouldn't be included by default. If anything, the default value should be nothing as common in Julia elsewhere: at least, nothing immediately throws on almost any operation instead of silently propagating, and cannot be forgotten.
Or (even better, I think) the default argument shouldn't have a default value ( :) ) at all, so that users always choose explicitly. Two common propagating not-a-value placeholders are NaN and missing.

bkamins · 2022-12-19T07:44:09Z

Apart from the general solution discussed here, in JuliaData/Missings.jl#143 I propose adding emptymissing as @nalimilan since it covers the most common problematic case.

nalimilan · 2022-12-24T17:03:12Z

Thanks @bkamins. Regarding the discussion about more general suggestions, note that this issue is currently in Statistics.jl, but it wouldn't be super logical to add such a general function in this package IMO. It's important to think about where it would go if that function was to be added.

bkamins · 2022-12-24T22:16:36Z

It's important to think about where it would go if that function was to be added.

I agree. That is why I thought adding emptymissing in Missings.jl was a clean choice.

I thought, that once a consensus is reached about the API, a decision can be made where such a function should live (maybe even Base Julia if it is general enough)

bkamins mentioned this issue Dec 19, 2022

add emptymissing JuliaData/Missings.jl#143

Merged

non-Jedi mentioned this issue Aug 28, 2024

Add init kwarg to mean/median to match other reductions in Base #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider allowing default in quantile and median #132

Consider allowing default in quantile and median #132

bkamins commented Nov 28, 2022

aplavin commented Nov 29, 2022

bkamins commented Nov 29, 2022

aplavin commented Nov 29, 2022

pdeffebach commented Nov 29, 2022

aplavin commented Nov 29, 2022 •

edited

Loading

bkamins commented Nov 29, 2022

aplavin commented Nov 29, 2022

aplavin commented Nov 29, 2022

bkamins commented Nov 29, 2022 •

edited

Loading

aplavin commented Nov 29, 2022 •

edited

Loading

bkamins commented Nov 29, 2022

nalimilan commented Nov 29, 2022

aplavin commented Nov 30, 2022

bkamins commented Nov 30, 2022

aplavin commented Nov 30, 2022

mreichMPI-BGC commented Dec 6, 2022

bkamins commented Dec 17, 2022

aplavin commented Dec 17, 2022 •

edited

Loading

bkamins commented Dec 17, 2022

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

jariji commented Dec 17, 2022

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

jariji commented Dec 17, 2022 •

edited

Loading

mreichMPI-BGC commented Dec 18, 2022

aplavin commented Dec 18, 2022

bkamins commented Dec 19, 2022

nalimilan commented Dec 24, 2022

bkamins commented Dec 24, 2022

Consider allowing default in quantile and median #132

Consider allowing default in quantile and median #132

Comments

bkamins commented Nov 28, 2022

aplavin commented Nov 29, 2022

bkamins commented Nov 29, 2022

aplavin commented Nov 29, 2022

pdeffebach commented Nov 29, 2022

aplavin commented Nov 29, 2022 • edited Loading

bkamins commented Nov 29, 2022

aplavin commented Nov 29, 2022

aplavin commented Nov 29, 2022

bkamins commented Nov 29, 2022 • edited Loading

aplavin commented Nov 29, 2022 • edited Loading

bkamins commented Nov 29, 2022

nalimilan commented Nov 29, 2022

aplavin commented Nov 30, 2022

bkamins commented Nov 30, 2022

aplavin commented Nov 30, 2022

mreichMPI-BGC commented Dec 6, 2022

bkamins commented Dec 17, 2022

aplavin commented Dec 17, 2022 • edited Loading

bkamins commented Dec 17, 2022

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

jariji commented Dec 17, 2022

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

jariji commented Dec 17, 2022 • edited Loading

mreichMPI-BGC commented Dec 18, 2022

aplavin commented Dec 18, 2022

bkamins commented Dec 19, 2022

nalimilan commented Dec 24, 2022

bkamins commented Dec 24, 2022

aplavin commented Nov 29, 2022 •

edited

Loading

bkamins commented Nov 29, 2022 •

edited

Loading

aplavin commented Nov 29, 2022 •

edited

Loading

aplavin commented Dec 17, 2022 •

edited

Loading

jariji commented Dec 17, 2022 •

edited

Loading