-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add isordered #26
base: main
Are you sure you want to change the base?
Add isordered #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; @nalimilan ?
test/runtests.jl
Outdated
|
||
@testset "isordered" begin | ||
|
||
@test isordered([1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are failing because isordered
isn't exported from DataAPI; I'd rather not export to be consistent with other DataAPI definitions, but I also think it's probably a fairly safe case if we really want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I will fix it - every time I think that the PR is simple (and I just edit it in the web browser) I mess up things. I should be more careful.
src/DataAPI.jl
Outdated
@@ -89,6 +91,9 @@ If the collection is not sortable then the order of levels is unspecified. | |||
|
|||
Contrary to [`unique`](@ref), this function may return values which do not | |||
actually occur in the data, and does not preserve their order of appearance in `x`. | |||
|
|||
If a type of `A` implements `levels` with a meaningful order then it should also | |||
implement `isordered` as by defualt `isordered` returns `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implement `isordered` as by defualt `isordered` returns `false`. | |
implement `isordered` as by default `isordered` returns `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence raises many issues about what we really want isordered
to mean. Types that implement levels are AbstractArray
s, and in general they don't know with what eltypes they will be used. So the ordering doesn't depend on them.
CategoricalArray
is a very special case, but even there I would say that isordered
works because the element type is CategoricalValue
, which supports <
. So everything is defined by the element type.
Likewise, below, I wouldn't say that <
is consistent with levels
. Since <
is defined on elements, not on arrays, and levels
is defined on arrays, there's only one way to make them consistent: have levels
return values sorted according to <
/isless
, which is what the levels
fallback does. So actually it's <
and isless
that have to be consistent (this is probably always the case).
What could happen is that a special array type could define levels
to return something which isn't consistent with <
/isless
. But that doesn't mean that entries in the arrays wouldn't be comparable using <
. Do we want to support this? That sounds a bit weird, but it could happen e.g. with a special type similar to CategoricalArray
but which wouldn't wrap values in CategoricalValue
(IOW a type similar to PooledArray
but for which levels
would return the pool). Should this kind of object define isordered
to return false
? That would be weird since entries could still be compared using <
-- but the result would be different from the order in levels
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I was thinking how to put it best and I just "kind of" reused what CategoricalArrays.jl defined:
Test whether entries in
A
can be compared using<
,>
and similar operators, using the ordering of levels.
Which implies that isordered
should tell us if what levels(A)
returns is ordered if we use <
for comparison of these elements (so technically if issorted(levels(A))
returns true
).
In particular I understood that if levels
returns something for a type that supports <
that fails issorted(levels(A))
then isordered
should return false
.
This was my understanding of the original intent of what isordered
should mean i.e. that two conditions must be met:
- eltype of
A
supports<
etc. levels
returns values in order consistent with it
(but maybe I was wrong).
But if I was right then levels
and isordered
must be in sync. This is different from Base.Ordered()
, as Base.Ordered()
is unrelated with levels
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the problem is that isordered
was initially designed with only CategoricalArrays in mind, and for them it happens to be the case that isordered
, levels
and <
/isless
are all consistent. So now to extend it we have to decide whether that should always be the case (and document it) or not. That would be the simplest and most obvious definition. But it's also quite strict as e.g. a custom array for which levels(x) == ["c", "b", "a"]
wouldn't respect it. In short, I think we should know what we want to use this trait for to choose the most useful solution.
@quinnj You mentioned at #23 (comment) that Arrow.jl could use this for a custom dict-encoded type. How would that type behave? How would it implement isordered
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arrow spec says:
/// By default, dictionaries are not ordered, or the order does not have
/// semantic meaning. In some statistical, applications, dictionary-encoding
/// is used to represent ordered categorical data, and we provide a way to
/// preserve that metadata here
isOrdered: bool;
So it's not really doing much by itself other than preserving the metadata. The DictEncoded
array type in Arrow.jl is basically just a PooledArray
, but also includes the boolean isOrdered
field; but any further "semantics" would need to be handled by the user or custom comparison code (or in the example of converting to a CategoricalArray, it could query the DataAPI.isordered
property).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be OK with both. The original definition came from CategoricalArrays.jl which are special, so we can be more flexible here I think (in which case isordered
is giving us roughly the same as Ordered
trait in Julia Base).
EDIT: so then the question is if we need both (the benefit of having it in DataAPI.jl is that we can change things more easily than in Julia Base).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Arrow.DictEncoded
doesn't want to expose the custom order of levels using generic API, then I don't think it makes sense for it to implement isordered
. Otherwise that function will be useless as it won't allow writing generic code that can do something with levels: you would know there's an order, but you wouldn't be able to retrieve it.
One issue with Base.OrderStyle
is that it's supposed to work when passing only the type, but for CategoricalArray it's a runtime property...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Arrow.DictEncoded
does want to expose the custom order of levels? I'd like to implement DataAPI.refarray
, DataAPI.refpool
, DataAPI.levels
and DataAPI.isordered
on DictEncoded, so it can participate in all the optimizations built around those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I thought you didn't want that as you said:
Personally, I don't think we should require
eltype
of A supports<
etc. andlevels
returns values in order consistent with it, if we don't have to.
Then we can require isordered
to be consistent with levels
at least, but mention that it may be inconsistent with <
and isless
. But we shouldn't say that isordered
should return true
for types that implement levels
, as it wouldn't always be the case for CategoricalArray
and Arrow.DictEncoded
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yeah, that makes sense to me. Yeah, for Arrow.DictEncoded
, isordered
is an independent property.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
As I have commented to @nalimilan on slack. I think it is the time to rethink the API that DataAPI.jl for ref values should expose in general, as we are getting more container types supporting it. For example another function that CategoricalArrays.jl exposes is is |
The use case I was thinking of was in Econometrics.jl, I need to assert whether the outcome variable is ordinal or nominal for choosing the appropriate estimator. When people use |
src/DataAPI.jl
Outdated
using the ordering of `levels(A)`. | ||
|
||
If a type of `A` implements `levels` with a meaningful order then it should also | ||
implement `isordered` as by defualt `isordered` returns `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo, default
Yes but I think that's on purpose: we have split the old |
My perspective: PooledArrays.jl is lightweight and CategoricalArrays.jl is (unfortunately, but unavoidably) heavy because of this distinction. |
I'm not sure that's true anymore now that we got rid of most invalidations. |
By "heavy" I did not mean problematic for Julia compiler, but rather that it has much more complex logic (e.g. the code for merging levels between two |
I am not opposed to having different representations (e.g., efficient storage of few values vs categorical). What would be useful is to have any representation that does not support or uses an ordinal property to be queryable about it. Basically, provide an API that allows to get the information: does this representation offer this information / property and if it does how to query it. |
Maybe you should just assume that your users will be using scientific types from https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/? |
Likely not. Currently, the docs is to just use CategoricalArrays but every so often a user encounters an error due to using PooledArray. I am aiming to make the code more general and friendly rather than restricted / with more assumptions. However, having the API here might help clean up the MLJScientificTypes code. |
OK, then I think we should go with what I proposed above at #26 (comment). |
Test whether entries in `A` can be compared using `<`, `>` and similar operators, | ||
using the ordering of `levels(A)`. | ||
|
||
If a type of `A` implements `levels` with a meaningful order then it should also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - could you please comment in the implementation how your proposal should be incorporated? Thank you!
Notice that OrderedCollections.jl also defines |
Good catch. Luckily it's not exported (JuliaCollections/OrderedCollections.jl#45). |
Fixes #23