-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add methods propagating missing for string functions #26631
Conversation
Also change rounding functions to use a keyword argument for `base`, just like other methods.
Huh? Again, my understanding is that it would be sufficient to provide such definitions for a relatively small fixed set of "operator-like" functions, and use a |
Also, in the first link it seems |
Yes, this one doesn't sound very useful. I added it for consistency with I think there's a logic to adding other methods, it's that they treats strings as scalars and are therefore very frequently used on dataset columns which can include missing values. That's not the case for most functions, so I do think there's a relatively natural limit to adding special methods like these (just like we only support scalar math functions currently). |
@@ -49,9 +49,6 @@ endswith(str::AbstractString, chars::Chars) = !isempty(str) && last(str) in char | |||
startswith(a::String, b::String) = sizeof(a) ≥ sizeof(b) && | |||
ccall(:memcmp, Int32, (Ptr{UInt8}, Ptr{UInt8}, UInt), a, b, sizeof(b)) == 0 | |||
|
|||
startswith(a::Vector{UInt8}, b::Vector{UInt8}) = length(a) ≥ length(b) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this need a deprecation? Also unrelated to missing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe. Though since it's not documented and it only works for Vector{UInt8}
, is it really needed? I can add one, or move this to another PR, as you prefer.
That may be, but the larger point is that we don't want to spend time debating which functions are used often enough on dataset columns. How often do people call I also clearly remember that we had a deal that a small, fixed number of Base functions would directly support missing data, so that issues like these would not have to be dealt with over and over again. |
Really, that's not the case. Looking at base/exports.jl, apart from scalar math operators and functions, if you pick functions at random you'll notice that for most of them a special method for |
Sure, there are clearly many functions that would never need |
Right, case in point: DataValue did not pick the same set of string functions, because these are now fairly arbitrary choices. |
The rule for DataValues.jl is pretty simple: I just add as many functions as I can think of. No debate needed, if someone wants to add another one, I'll add it. The reason why there are functions that are lifted in this PR but not in DataValues.jl is simply that I haven't done a systematic review so far. I'll most likely just take this PR here as a blueprint to add lifted versions of all the functions that are in this PR here to DataValues.jl as well. So the strategy for DataValues.jl is to try to cover as many functions as possible out of the box, and then have |
This seems really a key point. @nalimilan is that your understanding as well? I was not privy to these discussions, but I always had thought that the plan for I do have a lot of sympathy for @JeffBezanson's view that this is troubling in base, it is one of (multiple) reasons why I always thought the missing data story should never have been put into base in the first place. But having it in base and not going for comprehensive white listing really seems the worst of all options... |
This is actually a somewhat better approach --- wait until somebody needs it. The functions here are just a scattershot approach. There is a data point in the linked discourse thread above, but these definitions would not have actually helped in that case. I just don't think there should be an expectation that "every" function supports missing. There should be a small C#-like list of functions that handle missing, and otherwise it's manual. That way it's fairly easy to know whether you can write Adding methods to handle missingness for every function and every kind of missingness is not scalable. We need to push a more general approach as much as possible. (As a bonus, it's more efficient since there are fewer definitions weighing down the compiler.) I'm not sure Base really has a missing data story. We just have a |
Whatever the decision is made adding
which in particular means lifting would performed only on positional arguments. I hope there are no performance issues with those definitions. I guess it should be possible to inline them by a compiler - right?. If not the second one can be rewritten as:
I would recommend adding it (either to this PR or I can make a separate PR). |
|
Actually when discussing it on Discourse I avoided using |
Let's discuss |
This isn't going to happen so let's close it. For reference the |
The first commit fixes a few details in docs and removes
startswith(::Vector{UInt8}, ::Vector{UInt8})
, which is the onlystartswith
method accepting vectors and is not documented nor used anywhere.The second commit adds methods propagating
missing
to string functions. This has been mentioned before here and here. I have only implemented methods where propagating missing sounds the most natural and useful: basically, for functions which treat strings as scalars and return a string or a scalar. This is consistent with implementing basic math functions and operators, and with the fact that strings are considered as scalars in some situations (e.g.broadcast
).I have omitted (at least) the following functions, because either they are only useful when operating directly on character or indices (which will fail for
missing
anyway) or because they return arrays:isascii
,ascii
andisvalid
;length
,codeunits
,ncodeunits
,nextind
,prevind
,thisind
,first
andlast
;string
andrepr
;reverse
;split
andrsplit
. Some cases are unclear, but it's easy to add more later if that turns out to be useful.