Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace does not respect target type #352

Open
nalimilan opened this issue May 14, 2021 · 3 comments
Open

replace does not respect target type #352

nalimilan opened this issue May 14, 2021 · 3 comments

Comments

@nalimilan
Copy link
Member

Currently replace calls recode and always returns a CategoricalArray, even if the target values are not CategoricalValues. This is OK for recode as it's specific to CategoricalArrays, but ideally replace should respect the target type. Unfortunately, the behavior of replace on arrays is to call promote_type on the source element type and on target values' types. This would give weird arrays such as Array{Union{CategoricalValue{String,UInt32}, Int}}. We could use a different approach which would choose the element type based on the actual values, like broadcast. But that would trade an inconsistency with Base for another inconsistency.

For example, the following should ideally return a Vector{Int} (see this thread):

julia> a = CategoricalArray(["X", "X", "Y", "Z", "Y", "Y", "Z"])
7-element CategoricalArray{String,1,UInt32}:
 "X"
 "X"
 "Y"
 "Z"
 "Y"
 "Y"
 "Z"

julia> replace(a, "X"=>1, "Y"=>2, "Z"=>3)
7-element CategoricalArray{Union{Int64, String},1,UInt32}:
 1
 1
 2
 3
 2
 2
 3

Cc: @bkamins

@bkamins
Copy link
Member

bkamins commented May 14, 2021

I was thinking about it when I answered on SO and concluded that the design of replace was intentional because the contract for replace is:

Return a copy of collection A where ...

Which implies that the container type of returned value should be the same as if we did a copy (optionally doing type promotion).

I think that the crucial point of replace is that normally it is assumed that only some of the values in the source are replaced.

The operation requested on SO assumes we are doing mapping of all values. I think it is a valid use case but intuitively I would expect a different function for it (that would in particular check that a full mapping specification is provided). But maybe it would be not that useful. Not sure. Maybe it would be enough to add a kwarg to replace (and also recode) allowing user to decide if the result should be categorical or not?

@nalimilan
Copy link
Member Author

The part of the docstring (which I wrote IIRC!) that I was worried about is:

The element type of the result is chosen using promotion (see promote_type) based on the element type of A and on the types of the new values in pairs.

Clearly we don't respect this currently. But it's not easy to respect.

Maybe it would be enough to add a kwarg to replace (and also recode) allowing user to decide if the result should be categorical or not?

That wouldn't fix the inconsistency with the docstring, but yeah that could be useful. In particular that would allow the reverse choice, i.e. request a CategoricalArray when the input is an Array (this is common when you have integer-coded inputs that you want to turn into explicit categories).

@bkamins
Copy link
Member

bkamins commented May 16, 2021

I agree that meeting the contract is hard, that is why I thought of adding a kwarg that would allow to choose from one of possible intentions of the user. Also having a kwarg will make it easier for users to notice that there is a choice (and an issue, as I think many users could overlook the problem).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants