Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the return dtype of boolean transformer #247

Closed
wants to merge 1 commit into from

Conversation

katxiao
Copy link
Contributor

@katxiao katxiao commented Sep 23, 2021

In the BooleanTransformer, we always cast the reverse transformed data to type object, because if a boolean series contains NaNs, it will be type object. However, if the input is a boolean series without NaNs (therefore type bool), the output type will be inconsistent.

We could instead try to first cast the output type to bool, and if there is a ValueError due to the presence of NaNs, we could cast it to object. The alternative is to first check if the series has any null values, but we have previously seen that checking for nulls is not very performant.

@katxiao katxiao requested a review from csala September 23, 2021 23:53
@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2021

Codecov Report

Merging #247 (e07e40d) into master (54e8a0c) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
+ Coverage   94.92%   94.95%   +0.03%     
==========================================
  Files           9        9              
  Lines         571      575       +4     
==========================================
+ Hits          542      546       +4     
  Misses         29       29              
Impacted Files Coverage Δ
rdt/transformers/boolean.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54e8a0c...e07e40d. Read the comment docs.

@csala
Copy link
Contributor

csala commented Sep 24, 2021

Good catch @katxiao
This may create conflicts with the changes coming in #228 , so I would wait for that to be merged first and then see if we need to add this on top of it.

On the other hand, I think that the implemented approach is fine but it may also produce inconsistent behaviors: One may have fitted the transformer with nulls (so, passing object as input) but later on be reverse transforming data that does not contain nulls. In this case, I would rather expect the output to be object, like what was seen in fit.

Which actually brings us to a new question: Rather than taking care about this here, at the Transformer implementation level, maybe we should update the new BaseTransformer implementation to capture the fit input dtypes and simply cast back to them at the end of reverse_transform. By doing this, we make sure that any transformer always produces the same dtypes that it saw during fit, which are the ones that are truly relevant.

What do you think about it? CCing @amontanez24 @fealho too

@katxiao
Copy link
Contributor Author

katxiao commented Sep 24, 2021

Yeah, I think ideally the dtype should be set to whatever the input data's dtype is. We can either merge this first, or just wait until we make the larger BaseTransformer change (either is fine with me).

@katxiao katxiao closed this Sep 30, 2021
@csala csala deleted the update-boolean-transformer-return-dtype branch October 12, 2021 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants