Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods propagating missing for string functions #26631

Closed
wants to merge 2 commits into from

Conversation

nalimilan
Copy link
Member

The first commit fixes a few details in docs and removes startswith(::Vector{UInt8}, ::Vector{UInt8}), which is the only startswith method accepting vectors and is not documented nor used anywhere.

The second commit adds methods propagating missing to string functions. This has been mentioned before here and here. I have only implemented methods where propagating missing sounds the most natural and useful: basically, for functions which treat strings as scalars and return a string or a scalar. This is consistent with implementing basic math functions and operators, and with the fact that strings are considered as scalars in some situations (e.g. broadcast).

I have omitted (at least) the following functions, because either they are only useful when operating directly on character or indices (which will fail for missing anyway) or because they return arrays: isascii, ascii and isvalid; length, codeunits, ncodeunits, nextind, prevind, thisind, first and last; string and repr; reverse; split and rsplit. Some cases are unclear, but it's easy to add more later if that turns out to be useful.

Also change rounding functions to use a keyword argument for `base`, just like other methods.
@nalimilan nalimilan added strings "Strings!" missing data Base.missing and related functionality labels Mar 27, 2018
@JeffBezanson
Copy link
Member

Huh? Again, my understanding is that it would be sufficient to provide such definitions for a relatively small fixed set of "operator-like" functions, and use a lift function for everything else. Otherwise there is no limit or logic to these definitions. In this particular case, it's a bit hard to imagine it being important e.g. to be able to repeat a string an unknown number of times.

@JeffBezanson
Copy link
Member

Also, in the first link it seems skipmissing is the needed feature, since otherwise the result of the whole operation is just missing. That is another factor limiting the utility of lots of Missing methods.

@nalimilan
Copy link
Member Author

In this particular case, it's a bit hard to imagine it being important e.g. to be able to repeat a string an unknown number of times.

Yes, this one doesn't sound very useful. I added it for consistency with x^y, which already propagates missing values for numbers. I can remove it.

I think there's a logic to adding other methods, it's that they treats strings as scalars and are therefore very frequently used on dataset columns which can include missing values. That's not the case for most functions, so I do think there's a relatively natural limit to adding special methods like these (just like we only support scalar math functions currently).

@@ -49,9 +49,6 @@ endswith(str::AbstractString, chars::Chars) = !isempty(str) && last(str) in char
startswith(a::String, b::String) = sizeof(a) ≥ sizeof(b) &&
ccall(:memcmp, Int32, (Ptr{UInt8}, Ptr{UInt8}, UInt), a, b, sizeof(b)) == 0

startswith(a::Vector{UInt8}, b::Vector{UInt8}) = length(a) ≥ length(b) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this need a deprecation? Also unrelated to missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe. Though since it's not documented and it only works for Vector{UInt8}, is it really needed? I can add one, or move this to another PR, as you prefer.

@JeffBezanson
Copy link
Member

and are therefore very frequently used on dataset columns

That may be, but the larger point is that we don't want to spend time debating which functions are used often enough on dataset columns. How often do people call escape_string on a column? I'd rather not even get into that question; almost any function might be called on a dataset column.

I also clearly remember that we had a deal that a small, fixed number of Base functions would directly support missing data, so that issues like these would not have to be dealt with over and over again.

@nalimilan
Copy link
Member Author

almost any function might be called on a dataset column.

Really, that's not the case. Looking at base/exports.jl, apart from scalar math operators and functions, if you pick functions at random you'll notice that for most of them a special method for missing wouldn't make sense. To cite a few functions taken in their order of appearance: falses, popfirst!, filter, pairs, dump, notify, time, convert, methods, isopen, realpath, run.

@JeffBezanson
Copy link
Member

Sure, there are clearly many functions that would never need missing support, but this PR still seems like too much to me. I am not convinced that these functions are all needed so often on columns with missing data that writing lift(chop) instead of chop is unacceptable. And the more functions that have missing support, the more random and unpredictable it gets, and the harder it is to synchronize with other packages like DataValues.

@nalimilan
Copy link
Member Author

@JeffBezanson
Copy link
Member

Right, case in point: DataValue did not pick the same set of string functions, because these are now fairly arbitrary choices.

@davidanthoff
Copy link
Contributor

The rule for DataValues.jl is pretty simple: I just add as many functions as I can think of. No debate needed, if someone wants to add another one, I'll add it. The reason why there are functions that are lifted in this PR but not in DataValues.jl is simply that I haven't done a systematic review so far. I'll most likely just take this PR here as a blueprint to add lifted versions of all the functions that are in this PR here to DataValues.jl as well.

So the strategy for DataValues.jl is to try to cover as many functions as possible out of the box, and then have . lifting or manual extraction as the fallback if a user comes across a function that is not lifted out of the box. Overall that has proven pretty usable over the last years.

@davidanthoff
Copy link
Contributor

I also clearly remember that we had a deal that a small, fixed number of Base functions would directly support missing data, so that issues like these would not have to be dealt with over and over again.

This seems really a key point. @nalimilan is that your understanding as well? I was not privy to these discussions, but I always had thought that the plan for Missing was to provide an as comprehensive set of lifted function as possible. If that was actually never the plan, aren't we going in circles and will end up with similar usability problems that Nullable had?

I do have a lot of sympathy for @JeffBezanson's view that this is troubling in base, it is one of (multiple) reasons why I always thought the missing data story should never have been put into base in the first place. But having it in base and not going for comprehensive white listing really seems the worst of all options...

@JeffBezanson
Copy link
Member

if someone wants to add another one, I'll add it

This is actually a somewhat better approach --- wait until somebody needs it. The functions here are just a scattershot approach. There is a data point in the linked discourse thread above, but these definitions would not have actually helped in that case.

I just don't think there should be an expectation that "every" function supports missing. There should be a small C#-like list of functions that handle missing, and otherwise it's manual. That way it's fairly easy to know whether you can write f or need to write lift(f). I don't want to be adding missing support to N new functions per week or month or whatever and have a moving target.

Adding methods to handle missingness for every function and every kind of missingness is not scalable. We need to push a more general approach as much as possible. (As a bonus, it's more efficient since there are fewer definitions weighing down the compiler.)

I'm not sure Base really has a missing data story. We just have a Missing type that implements 3VL. Is that really so objectionable? It has various potential uses, e.g. semi-predicates. If Base wants to support semi-predicates, is there a reasonable alternative to returning true, false, or missing?

@bkamins
Copy link
Member

bkamins commented Mar 29, 2018

Whatever the decision is made adding lift to base seems needed (I guess ). This is what I think functionally would be good:

lift(f::Function) = (x...; kw...) -> any(ismissing, x) ? missing : f(x...; kw...)
lift(f::Function, x...; kw...) = lift(f)(x...; kw...)

which in particular means lifting would performed only on positional arguments.

I hope there are no performance issues with those definitions. I guess it should be possible to inline them by a compiler - right?. If not the second one can be rewritten as:

lift(f::Function, x...; kw...) = any(ismissing, x) ? missing : f(x...; kw...)

I would recommend adding it (either to this PR or I can make a separate PR).

@yurivish
Copy link
Contributor

lift is a pretty general name and concept — are we sure we want to use that word for the specific meaning of lifting something to the domain of potentially-missing inputs?

@bkamins
Copy link
Member

bkamins commented Mar 29, 2018

Actually when discussing it on Discourse I avoided using lift as a name for this reason but it seems that this name naturally popped up several times independently. My opinion is that it should be preferably a short name, as it pops-up often when working with real data (at least in my experience).

@nalimilan
Copy link
Member Author

Let's discuss lift in a different issue.

@nalimilan
Copy link
Member Author

This isn't going to happen so let's close it. For reference the lift function described above is implemented under the name passmissing in Missings.jl.

@nalimilan nalimilan closed this Mar 2, 2022
@nalimilan nalimilan deleted the nl/strmissing branch March 2, 2022 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants