Skip to content

Extend PSI feature selection to categorical variables #657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

dlaprins
Copy link

@dlaprins dlaprins commented Apr 18, 2023

closes #655
closes #658

@dlaprins dlaprins marked this pull request as draft April 18, 2023 12:56
@dlaprins dlaprins marked this pull request as ready for review April 18, 2023 13:37
@ClaudioSalvatoreArcidiacono
Copy link
Contributor

Hey @dlaprins ping me once this PR is ready for review!

@dlaprins
Copy link
Author

Hey @ClaudioSalvatoreArcidiacono , ready for review. Thanks for the heads-up on category-type variables: the min_pct_empty_bins wasn't handled correctly. Fixed it.

I included all your tests, just changed the ordering of the variables (self.variables_ takes all numericals first, then all categoricals).

B = number of bins/categories,
N = size of basis dataset,
M = size of test dataset.
See formula (5.2) from reference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this comes from the previous version of docstring, but is this reference actually valid? Does It refer to something in the class docstring? If it is referring to outside resources, like feature engine documentation, isn't it better to refer with a link?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hidden method, so it won't be exposed to users. It is mostly for developers. If it is not clear which reference it refers to, then we should spell it out though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see that p_value comes here, that's why we change the logic of the auto-threshold. We need to update the dosctrings.

@glevv could you review this bit please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference is to reference [1] in the docstring of the class (i.e., Yurdakul (2018, https://scholarworks.wmich.edu/dissertations/3208/). I have left it as is, let me know if you want to see the reference spelled out more explicitly

Copy link
Contributor

@glevv glevv Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See formula (5.2) in [1] would be fine, I guess.

p_value will be used only when threshold is auto? But then why the threshold is called auto if user still needs to set the parameter? We just replaced one with another. I don't have strong opinion on this one, but seems to me that it's better to not expose p_value parameter or rethink parameters of this transformer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the reference as per your suggestion.

It is not necessary for the user to set the p-value, but it is optional, so in that sense a user who just wants to run it without thinking about parameters can still do so with the auto-setting. I am of the opinion that it would be a very useful feature to be able to set your own p-value, especially if one has to justify such values to stakeholders. The functionality of `auto' threshold versus float threshold are not identical, as the former varies per number of categories whereas the latter is identical for all features.

# taking q = 0.999 to get higher threshold
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
def _calculate_auto_threshold(self, N, M, bins):
"""Threshold computation for chi-square test.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also write docstring for the arguments?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing the default parameters of the auto-threshold? We used to default q to 0.999 and now this is an obligatory parameter.

What's the logic?

fyi @glevv

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote some form of docstring for the arguments, but not sure if it's in the format desired. Please let me know if you want it to be something else.

The logic behind changing the arguments is that the auto-threshold is called on by the class, but the p-value (or quantile, as previously used) could not be defined by the user. I would deem this to be a necessary option in order to make use of the auto-threshold (which I think is very useful in and of itself).

The bins argument needs to be added for auto-threshold to be compatible with categorical features. For details, please see my discussion with Claudio in the Issues thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the details. I went over the discussion in the issue. To summarize:

  • if user sets 10 bins for the numerical variables, bins=10
  • for categorical variables bins varies. Some may be 3, some maybe 5, some maybe huge (imagine variable cities, or postcodes)

Currently, we are yet calculating 1 threshold for all variables. The idea you propose is to have tailored thresholds, do I get this right? the same threshold for all numerical variables (they have same bins) and a varying threshold for categorical variables.

Sounds good to me. If p_value is the default, then it has backwards compatibility, but the user can also change it.

Having said this, the current code calculates 1 threshold for all variables, we need to change this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this issue: You are right that we need n_bins as a parameter of autothreshold if we test categoricals because the bins vary.

The question still remains: should we expose p_values?

My thinking: most users will not be familiar with this implementation of the autothreshold calculation, and they probably won't know what a good value for p_value is. So, I wonder how much more value it adds to expose this parameter, as per @glevv suggestion.

What are your thoughts?

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dlaprins and @ClaudioSalvatoreArcidiacono

Thank you so much for the active discussions and code changes!.

I added my comments at the back of @ClaudioSalvatoreArcidiacono 's

@dlaprins will you be able to update the code implementation?

Thanks a lot!

# taking q = 0.999 to get higher threshold
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
def _calculate_auto_threshold(self, N, M, bins):
"""Threshold computation for chi-square test.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing the default parameters of the auto-threshold? We used to default q to 0.999 and now this is an obligatory parameter.

What's the logic?

fyi @glevv

B = number of bins/categories,
N = size of basis dataset,
M = size of test dataset.
See formula (5.2) from reference.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hidden method, so it won't be exposed to users. It is mostly for developers. If it is not clear which reference it refers to, then we should spell it out though.

B = number of bins/categories,
N = size of basis dataset,
M = size of test dataset.
See formula (5.2) from reference.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see that p_value comes here, that's why we change the logic of the auto-threshold. We need to update the dosctrings.

@glevv could you review this bit please?

@dlaprins
Copy link
Author

Thank you both @ClaudioSalvatoreArcidiacono @solegalli for your helpful suggestions and directions. I have tried to fix all the shortcomings you pointed out in the comments in my commit. Please let me know which issues remain unresolved and I'll solve those as well.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dlaprins

Thanks for the code changes. I went over the code and discussion, here are some more thoughts:

  • we are now not capturing numerical and categorical variables in new attributes, that's great. Later on in the logic we are still looping over self.num_variables_ even though it does not exist any more. Could you please fix?
  • It is not necessary to loop over self.variables_ we can remove the nested looping.
  • n_bins is being reassigned for each feature, this is not necessary for numerical, only for categorical. Could we update?
  • the threshold is calculated once for all variables, I think the idea is to use the same threshold for numerical, and a changing threshold for categorical. Could we reflect this in the code?
  • The question still remains: should we expose p_values?
  • it would be great if we could add the threshold calculation formula in the init docstrings as well.
  • @glevv mentioned that the calculation would not work for this example: current implementation will give erroneous results if you have some differences in categories in the same columns (cat_1_train have categories 'A', 'B' and 'C', and cat_1_test have categories 'A', 'B' and 'D'). Could we add that as a test and make sure it works somehow? or raises an exception to alert users on what to do?

Thank you for the great work!

if self.strategy == "equal_width":
bucketer = EqualWidthDiscretiser(bins=self.bins)
else:
bucketer = EqualFrequencyDiscretiser(q=self.bins)

# Compute the PSI by looping over the features
self.psi_values_ = {}
self.features_to_drop_ = []

for feature in self.variables_:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one still needs refactoring @dlaprins

if feature in self.num_variables_:
basis_discrete = bucketer.fit_transform(basis_df[[feature]].dropna())
test_discrete = bucketer.transform(test_df[[feature]].dropna())
n_bins = self.bins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are re-assigning this value at each iteration. For numerical variables it is not necessary, because it is always self.bins.

For categorical variables, n_bins varies, but then, we do not calculate the threshold in the loop, we calculate it once and for all variables as it currently stands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if psi selection is extended to categorical features, then we need to use list of auto thresholds depending on number of bins in these features.

@@ -419,6 +434,16 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None):
self.psi_values_[feature] = np.sum(
(test_distrib - basis_distrib) * np.log(test_distrib / basis_distrib)
)

# Determine the appropriate threshold
if self.threshold == "auto":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using a unique threshold for all variables as it stands now. And it will be calculated based on n_bins, which as it stands is the n of categories of the last variable examined. This needs fixing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @solegalli,

Let me try to address all the comments (including the other ones you left) in this one reply:

Hey @dlaprins

Thanks for the code changes. I went over the code and discussion, here are some more thoughts:

* [ ]  we are now not capturing numerical and categorical variables in new attributes, that's great. Later on in the logic we are still looping over self.num_variables_ even though it does not exist any more. Could you please fix?

An oversight indeed.

* [ ]  It is not necessary to loop over self.variables_ we can remove the nested looping.

I'm a bit confused here. What do you mean by nested looping? As it stands, I believe there is a single for-loop over self.variables and only an if-statement inside of that for-loop, no other for-loops (for what it's worth: the for-loop over self.variables is necessary due to the dropna(), I tried vectorizing the for-loop before I noticed this and got wrong results).

* [ ]  n_bins is being reassigned for each feature, this is not necessary for numerical, only for categorical. Could we update?

The only way I see to update this is to use 2 loops, one over categoricals and one over numericals. Then n_bins doesn't have to be checked for each feature in the numerical loop. Is that what you have in mind?

* [ ]  the threshold is calculated once for all variables, I think the idea is to use the same threshold for numerical, and a changing threshold for categorical. Could we reflect this in the code?

The threshold calculation is inside the loop over self.variables. Similar to n_bins, we can have a single numerical threshold outside of a for-loop over numericals, while for categoricals the threshold has to be calculated inside of the for-loop.

* [ ]  The question still remains: should we expose p_values?

I believe we should. As a user I would find it rather unpleasant if I cannot set my own p-value threshold for a statistical test. It's fine if a user does not know what p_value they should set, as there's a default value for the p-value that they can use.

* [ ]  it would be great if we could add the threshold calculation formula in the init docstrings as well.

Will do

* [ ]  @glevv mentioned that the calculation would not work for this example: current implementation will give erroneous results if you have some differences in categories in the same columns (cat_1_train have categories 'A', 'B' and 'C', and cat_1_test have categories 'A', 'B' and 'D'). Could we add that as a test and make sure it works somehow? or raises an exception to alert users on what to do?

Is this really the case? I believe this is tested for by Claudio's test (see line 35, 85, 100 of test_drop_high_psi_features.py): "drift_cat_1" is exactly such that basis_test contains only "A" while test_df contains only "B". This is also why I got errors before modifying _observation_frequency_per_bin() when I changed the "drift_cat_1" feature in the test to be a categorical feature. Please correct me if I'm wrong here.

Thank you for the great work!

Thank you for the suggestions on how to improve the code.

I have left the loop over self.variables as it is for now, but if you would prefer to have 2 loops instead (or if I'm just not understanding you properly), let me know and I'll correct it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dlaprins yes, that was what I meant, 2 for loops instead of the for and the ifs. Sorry I wasn't clear.

Thank you for answering all the points!

# taking q = 0.999 to get higher threshold
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
def _calculate_auto_threshold(self, N, M, bins):
"""Threshold computation for chi-square test.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the details. I went over the discussion in the issue. To summarize:

  • if user sets 10 bins for the numerical variables, bins=10
  • for categorical variables bins varies. Some may be 3, some maybe 5, some maybe huge (imagine variable cities, or postcodes)

Currently, we are yet calculating 1 threshold for all variables. The idea you propose is to have tailored thresholds, do I get this right? the same threshold for all numerical variables (they have same bins) and a varying threshold for categorical variables.

Sounds good to me. If p_value is the default, then it has backwards compatibility, but the user can also change it.

Having said this, the current code calculates 1 threshold for all variables, we need to change this.

else:
bucketer = EqualFrequencyDiscretiser(q=self.bins)
# set up the discretizer for numerical variables
if len(self.num_variables_) > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glevv what's your view on this?

# taking q = 0.999 to get higher threshold
return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
def _calculate_auto_threshold(self, N, M, bins):
"""Threshold computation for chi-square test.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this issue: You are right that we need n_bins as a parameter of autothreshold if we test categoricals because the bins vary.

The question still remains: should we expose p_values?

My thinking: most users will not be familiar with this implementation of the autothreshold calculation, and they probably won't know what a good value for p_value is. So, I wonder how much more value it adds to expose this parameter, as per @glevv suggestion.

What are your thoughts?

dlaprins and others added 2 commits April 21, 2023 11:59
* Transform positional argument into keyword argument

From pandas 2.0 any only accepts keyworkd arguments
ref pandas-dev/pandas#44896

* Change how reciprocal is computed

I have not fully understood why this solve the problem, but splitting
the operation in 2 lines does not seem to work

* Catch warnings from pandas.to_datetime

Now pandas.to_datetime raises a warning when the column cannot be converted

* check_dtype=False in tests datetime features

Pandas dataframes created from python integers are created with int
column types `int64` but the operation tested returns `int32` which
caused issues

* Use droplevel before merging

Merging dfs with different column lelvels has been disallowed
ref pandas-dev/pandas#34862

* Change expected values for months

I am not sure why this caused an issue, maybe due to type casting?

* run black

* run black on tests

* isort _variable_type_checks.py

* Fix datetime_subtraction

---------

Co-authored-by: Claudio Salvatore Arcidiacono <claudio.arcidiacono@mollie.com>
@solegalli
Copy link
Collaborator

FYI @dlaprins

Please rebase main, we just merged #660 :)

@dlaprins
Copy link
Author

FYI @dlaprins

Please rebase main, we just merged #660 :)

Fixed, as well as split up the loop over variables to 2 loops over cat variables and num variables separately as requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PSI feature selection for categorical features
4 participants