-
-
Notifications
You must be signed in to change notification settings - Fork 324
Extend PSI feature selection to categorical variables #657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey @dlaprins ping me once this PR is ready for review! |
…ies for a certain category
Hey @ClaudioSalvatoreArcidiacono , ready for review. Thanks for the heads-up on category-type variables: the min_pct_empty_bins wasn't handled correctly. Fixed it. I included all your tests, just changed the ordering of the variables (self.variables_ takes all numericals first, then all categoricals). |
B = number of bins/categories, | ||
N = size of basis dataset, | ||
M = size of test dataset. | ||
See formula (5.2) from reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this comes from the previous version of docstring, but is this reference actually valid? Does It refer to something in the class docstring? If it is referring to outside resources, like feature engine documentation, isn't it better to refer with a link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hidden method, so it won't be exposed to users. It is mostly for developers. If it is not clear which reference it refers to, then we should spell it out though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see that p_value comes here, that's why we change the logic of the auto-threshold. We need to update the dosctrings.
@glevv could you review this bit please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference is to reference [1] in the docstring of the class (i.e., Yurdakul (2018, https://scholarworks.wmich.edu/dissertations/3208/). I have left it as is, let me know if you want to see the reference spelled out more explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See formula (5.2) in [1]
would be fine, I guess.
p_value
will be used only when threshold is auto
? But then why the threshold is called auto
if user still needs to set the parameter? We just replaced one with another. I don't have strong opinion on this one, but seems to me that it's better to not expose p_value
parameter or rethink parameters of this transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the reference as per your suggestion.
It is not necessary for the user to set the p-value, but it is optional, so in that sense a user who just wants to run it without thinking about parameters can still do so with the auto-setting. I am of the opinion that it would be a very useful feature to be able to set your own p-value, especially if one has to justify such values to stakeholders. The functionality of `auto' threshold versus float threshold are not identical, as the former varies per number of categories whereas the latter is identical for all features.
# taking q = 0.999 to get higher threshold | ||
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M) | ||
def _calculate_auto_threshold(self, N, M, bins): | ||
"""Threshold computation for chi-square test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also write docstring for the arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing the default parameters of the auto-threshold? We used to default q to 0.999 and now this is an obligatory parameter.
What's the logic?
fyi @glevv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote some form of docstring for the arguments, but not sure if it's in the format desired. Please let me know if you want it to be something else.
The logic behind changing the arguments is that the auto-threshold is called on by the class, but the p-value (or quantile, as previously used) could not be defined by the user. I would deem this to be a necessary option in order to make use of the auto-threshold (which I think is very useful in and of itself).
The bins argument needs to be added for auto-threshold to be compatible with categorical features. For details, please see my discussion with Claudio in the Issues thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for the details. I went over the discussion in the issue. To summarize:
- if user sets 10 bins for the numerical variables, bins=10
- for categorical variables bins varies. Some may be 3, some maybe 5, some maybe huge (imagine variable cities, or postcodes)
Currently, we are yet calculating 1 threshold for all variables. The idea you propose is to have tailored thresholds, do I get this right? the same threshold for all numerical variables (they have same bins) and a varying threshold for categorical variables.
Sounds good to me. If p_value is the default, then it has backwards compatibility, but the user can also change it.
Having said this, the current code calculates 1 threshold for all variables, we need to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this issue: You are right that we need n_bins as a parameter of autothreshold if we test categoricals because the bins vary.
The question still remains: should we expose p_values?
My thinking: most users will not be familiar with this implementation of the autothreshold calculation, and they probably won't know what a good value for p_value is. So, I wonder how much more value it adds to expose this parameter, as per @glevv suggestion.
What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dlaprins and @ClaudioSalvatoreArcidiacono
Thank you so much for the active discussions and code changes!.
I added my comments at the back of @ClaudioSalvatoreArcidiacono 's
@dlaprins will you be able to update the code implementation?
Thanks a lot!
# taking q = 0.999 to get higher threshold | ||
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M) | ||
def _calculate_auto_threshold(self, N, M, bins): | ||
"""Threshold computation for chi-square test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing the default parameters of the auto-threshold? We used to default q to 0.999 and now this is an obligatory parameter.
What's the logic?
fyi @glevv
B = number of bins/categories, | ||
N = size of basis dataset, | ||
M = size of test dataset. | ||
See formula (5.2) from reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hidden method, so it won't be exposed to users. It is mostly for developers. If it is not clear which reference it refers to, then we should spell it out though.
B = number of bins/categories, | ||
N = size of basis dataset, | ||
M = size of test dataset. | ||
See formula (5.2) from reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see that p_value comes here, that's why we change the logic of the auto-threshold. We need to update the dosctrings.
@glevv could you review this bit please?
Thank you both @ClaudioSalvatoreArcidiacono @solegalli for your helpful suggestions and directions. I have tried to fix all the shortcomings you pointed out in the comments in my commit. Please let me know which issues remain unresolved and I'll solve those as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dlaprins
Thanks for the code changes. I went over the code and discussion, here are some more thoughts:
- we are now not capturing numerical and categorical variables in new attributes, that's great. Later on in the logic we are still looping over self.num_variables_ even though it does not exist any more. Could you please fix?
- It is not necessary to loop over self.variables_ we can remove the nested looping.
- n_bins is being reassigned for each feature, this is not necessary for numerical, only for categorical. Could we update?
- the threshold is calculated once for all variables, I think the idea is to use the same threshold for numerical, and a changing threshold for categorical. Could we reflect this in the code?
- The question still remains: should we expose p_values?
- it would be great if we could add the threshold calculation formula in the init docstrings as well.
- @glevv mentioned that the calculation would not work for this example: current implementation will give erroneous results if you have some differences in categories in the same columns (cat_1_train have categories 'A', 'B' and 'C', and cat_1_test have categories 'A', 'B' and 'D'). Could we add that as a test and make sure it works somehow? or raises an exception to alert users on what to do?
Thank you for the great work!
if self.strategy == "equal_width": | ||
bucketer = EqualWidthDiscretiser(bins=self.bins) | ||
else: | ||
bucketer = EqualFrequencyDiscretiser(q=self.bins) | ||
|
||
# Compute the PSI by looping over the features | ||
self.psi_values_ = {} | ||
self.features_to_drop_ = [] | ||
|
||
for feature in self.variables_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one still needs refactoring @dlaprins
if feature in self.num_variables_: | ||
basis_discrete = bucketer.fit_transform(basis_df[[feature]].dropna()) | ||
test_discrete = bucketer.transform(test_df[[feature]].dropna()) | ||
n_bins = self.bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are re-assigning this value at each iteration. For numerical variables it is not necessary, because it is always self.bins.
For categorical variables, n_bins varies, but then, we do not calculate the threshold in the loop, we calculate it once and for all variables as it currently stands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, if psi selection is extended to categorical features, then we need to use list of auto thresholds depending on number of bins in these features.
@@ -419,6 +434,16 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None): | |||
self.psi_values_[feature] = np.sum( | |||
(test_distrib - basis_distrib) * np.log(test_distrib / basis_distrib) | |||
) | |||
|
|||
# Determine the appropriate threshold | |||
if self.threshold == "auto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are using a unique threshold for all variables as it stands now. And it will be calculated based on n_bins, which as it stands is the n of categories of the last variable examined. This needs fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @solegalli,
Let me try to address all the comments (including the other ones you left) in this one reply:
Hey @dlaprins
Thanks for the code changes. I went over the code and discussion, here are some more thoughts:
* [ ] we are now not capturing numerical and categorical variables in new attributes, that's great. Later on in the logic we are still looping over self.num_variables_ even though it does not exist any more. Could you please fix?
An oversight indeed.
* [ ] It is not necessary to loop over self.variables_ we can remove the nested looping.
I'm a bit confused here. What do you mean by nested looping? As it stands, I believe there is a single for-loop over self.variables and only an if-statement inside of that for-loop, no other for-loops (for what it's worth: the for-loop over self.variables is necessary due to the dropna(), I tried vectorizing the for-loop before I noticed this and got wrong results).
* [ ] n_bins is being reassigned for each feature, this is not necessary for numerical, only for categorical. Could we update?
The only way I see to update this is to use 2 loops, one over categoricals and one over numericals. Then n_bins doesn't have to be checked for each feature in the numerical loop. Is that what you have in mind?
* [ ] the threshold is calculated once for all variables, I think the idea is to use the same threshold for numerical, and a changing threshold for categorical. Could we reflect this in the code?
The threshold calculation is inside the loop over self.variables. Similar to n_bins, we can have a single numerical threshold outside of a for-loop over numericals, while for categoricals the threshold has to be calculated inside of the for-loop.
* [ ] The question still remains: should we expose p_values?
I believe we should. As a user I would find it rather unpleasant if I cannot set my own p-value threshold for a statistical test. It's fine if a user does not know what p_value they should set, as there's a default value for the p-value that they can use.
* [ ] it would be great if we could add the threshold calculation formula in the init docstrings as well.
Will do
* [ ] @glevv mentioned that the calculation would not work for this example: current implementation will give erroneous results if you have some differences in categories in the same columns (cat_1_train have categories 'A', 'B' and 'C', and cat_1_test have categories 'A', 'B' and 'D'). Could we add that as a test and make sure it works somehow? or raises an exception to alert users on what to do?
Is this really the case? I believe this is tested for by Claudio's test (see line 35, 85, 100 of test_drop_high_psi_features.py): "drift_cat_1" is exactly such that basis_test contains only "A" while test_df contains only "B". This is also why I got errors before modifying _observation_frequency_per_bin() when I changed the "drift_cat_1" feature in the test to be a categorical feature. Please correct me if I'm wrong here.
Thank you for the great work!
Thank you for the suggestions on how to improve the code.
I have left the loop over self.variables as it is for now, but if you would prefer to have 2 loops instead (or if I'm just not understanding you properly), let me know and I'll correct it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dlaprins yes, that was what I meant, 2 for loops instead of the for and the ifs. Sorry I wasn't clear.
Thank you for answering all the points!
# taking q = 0.999 to get higher threshold | ||
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M) | ||
def _calculate_auto_threshold(self, N, M, bins): | ||
"""Threshold computation for chi-square test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for the details. I went over the discussion in the issue. To summarize:
- if user sets 10 bins for the numerical variables, bins=10
- for categorical variables bins varies. Some may be 3, some maybe 5, some maybe huge (imagine variable cities, or postcodes)
Currently, we are yet calculating 1 threshold for all variables. The idea you propose is to have tailored thresholds, do I get this right? the same threshold for all numerical variables (they have same bins) and a varying threshold for categorical variables.
Sounds good to me. If p_value is the default, then it has backwards compatibility, but the user can also change it.
Having said this, the current code calculates 1 threshold for all variables, we need to change this.
else: | ||
bucketer = EqualFrequencyDiscretiser(q=self.bins) | ||
# set up the discretizer for numerical variables | ||
if len(self.num_variables_) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glevv what's your view on this?
# taking q = 0.999 to get higher threshold | ||
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M) | ||
def _calculate_auto_threshold(self, N, M, bins): | ||
"""Threshold computation for chi-square test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this issue: You are right that we need n_bins as a parameter of autothreshold if we test categoricals because the bins vary.
The question still remains: should we expose p_values?
My thinking: most users will not be familiar with this implementation of the autothreshold calculation, and they probably won't know what a good value for p_value is. So, I wonder how much more value it adds to expose this parameter, as per @glevv suggestion.
What are your thoughts?
* Transform positional argument into keyword argument From pandas 2.0 any only accepts keyworkd arguments ref pandas-dev/pandas#44896 * Change how reciprocal is computed I have not fully understood why this solve the problem, but splitting the operation in 2 lines does not seem to work * Catch warnings from pandas.to_datetime Now pandas.to_datetime raises a warning when the column cannot be converted * check_dtype=False in tests datetime features Pandas dataframes created from python integers are created with int column types `int64` but the operation tested returns `int32` which caused issues * Use droplevel before merging Merging dfs with different column lelvels has been disallowed ref pandas-dev/pandas#34862 * Change expected values for months I am not sure why this caused an issue, maybe due to type casting? * run black * run black on tests * isort _variable_type_checks.py * Fix datetime_subtraction --------- Co-authored-by: Claudio Salvatore Arcidiacono <claudio.arcidiacono@mollie.com>
…ies for a certain category
closes #655
closes #658