Extend PSI feature selection to categorical variables #657

dlaprins · 2023-04-18T12:17:25Z

closes #655
closes #658

ClaudioSalvatoreArcidiacono · 2023-04-18T17:00:30Z

Hey @dlaprins ping me once this PR is ready for review!

…ies for a certain category

dlaprins · 2023-04-19T10:18:21Z

Hey @ClaudioSalvatoreArcidiacono , ready for review. Thanks for the heads-up on category-type variables: the min_pct_empty_bins wasn't handled correctly. Fixed it.

I included all your tests, just changed the ordering of the variables (self.variables_ takes all numericals first, then all categoricals).

feature_engine/selection/drop_psi_features.py

ClaudioSalvatoreArcidiacono · 2023-04-19T13:16:13Z

feature_engine/selection/drop_psi_features.py

+        B = number of bins/categories,
+        N = size of basis dataset,
+        M = size of test dataset.
+        See formula (5.2) from reference.


I know this comes from the previous version of docstring, but is this reference actually valid? Does It refer to something in the class docstring? If it is referring to outside resources, like feature engine documentation, isn't it better to refer with a link?

This is a hidden method, so it won't be exposed to users. It is mostly for developers. If it is not clear which reference it refers to, then we should spell it out though.

Ok, I see that p_value comes here, that's why we change the logic of the auto-threshold. We need to update the dosctrings.

@glevv could you review this bit please?

The reference is to reference [1] in the docstring of the class (i.e., Yurdakul (2018, https://scholarworks.wmich.edu/dissertations/3208/). I have left it as is, let me know if you want to see the reference spelled out more explicitly

See formula (5.2) in [1] would be fine, I guess.

p_value will be used only when threshold is auto? But then why the threshold is called auto if user still needs to set the parameter? We just replaced one with another. I don't have strong opinion on this one, but seems to me that it's better to not expose p_value parameter or rethink parameters of this transformer.

I fixed the reference as per your suggestion.

It is not necessary for the user to set the p-value, but it is optional, so in that sense a user who just wants to run it without thinking about parameters can still do so with the auto-setting. I am of the opinion that it would be a very useful feature to be able to set your own p-value, especially if one has to justify such values to stakeholders. The functionality of `auto' threshold versus float threshold are not identical, as the former varies per number of categories whereas the latter is identical for all features.

ClaudioSalvatoreArcidiacono · 2023-04-19T13:16:37Z

feature_engine/selection/drop_psi_features.py

-        # taking q = 0.999 to get higher threshold
-        return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
+    def _calculate_auto_threshold(self, N, M, bins):
+        """Threshold computation for chi-square test.


could you also write docstring for the arguments?

Why are we changing the default parameters of the auto-threshold? We used to default q to 0.999 and now this is an obligatory parameter.

What's the logic?

fyi @glevv

Wrote some form of docstring for the arguments, but not sure if it's in the format desired. Please let me know if you want it to be something else.

The logic behind changing the arguments is that the auto-threshold is called on by the class, but the p-value (or quantile, as previously used) could not be defined by the user. I would deem this to be a necessary option in order to make use of the auto-threshold (which I think is very useful in and of itself).

The bins argument needs to be added for auto-threshold to be compatible with categorical features. For details, please see my discussion with Claudio in the Issues thread.

Hey, thanks for the details. I went over the discussion in the issue. To summarize:

if user sets 10 bins for the numerical variables, bins=10

for categorical variables bins varies. Some may be 3, some maybe 5, some maybe huge (imagine variable cities, or postcodes)

Currently, we are yet calculating 1 threshold for all variables. The idea you propose is to have tailored thresholds, do I get this right? the same threshold for all numerical variables (they have same bins) and a varying threshold for categorical variables.

Sounds good to me. If p_value is the default, then it has backwards compatibility, but the user can also change it.

Having said this, the current code calculates 1 threshold for all variables, we need to change this.

Coming back to this issue: You are right that we need n_bins as a parameter of autothreshold if we test categoricals because the bins vary.

The question still remains: should we expose p_values?

My thinking: most users will not be familiar with this implementation of the autothreshold calculation, and they probably won't know what a good value for p_value is. So, I wonder how much more value it adds to expose this parameter, as per @glevv suggestion.

What are your thoughts?

feature_engine/selection/drop_psi_features.py

solegalli

Hey @dlaprins and @ClaudioSalvatoreArcidiacono

Thank you so much for the active discussions and code changes!.

I added my comments at the back of @ClaudioSalvatoreArcidiacono 's

@dlaprins will you be able to update the code implementation?

Thanks a lot!

feature_engine/selection/drop_psi_features.py

solegalli · 2023-04-20T13:29:09Z

feature_engine/selection/drop_psi_features.py

-        # taking q = 0.999 to get higher threshold
-        return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
+    def _calculate_auto_threshold(self, N, M, bins):
+        """Threshold computation for chi-square test.


Why are we changing the default parameters of the auto-threshold? We used to default q to 0.999 and now this is an obligatory parameter.

What's the logic?

fyi @glevv

solegalli · 2023-04-20T13:31:14Z

feature_engine/selection/drop_psi_features.py

+        B = number of bins/categories,
+        N = size of basis dataset,
+        M = size of test dataset.
+        See formula (5.2) from reference.


This is a hidden method, so it won't be exposed to users. It is mostly for developers. If it is not clear which reference it refers to, then we should spell it out though.

feature_engine/selection/drop_psi_features.py

solegalli · 2023-04-20T13:41:35Z

feature_engine/selection/drop_psi_features.py

+        B = number of bins/categories,
+        N = size of basis dataset,
+        M = size of test dataset.
+        See formula (5.2) from reference.


Ok, I see that p_value comes here, that's why we change the logic of the auto-threshold. We need to update the dosctrings.

@glevv could you review this bit please?

dlaprins · 2023-04-20T17:41:07Z

Thank you both @ClaudioSalvatoreArcidiacono @solegalli for your helpful suggestions and directions. I have tried to fix all the shortcomings you pointed out in the comments in my commit. Please let me know which issues remain unresolved and I'll solve those as well.

solegalli

Hey @dlaprins

Thanks for the code changes. I went over the code and discussion, here are some more thoughts:

we are now not capturing numerical and categorical variables in new attributes, that's great. Later on in the logic we are still looping over self.num_variables_ even though it does not exist any more. Could you please fix?
It is not necessary to loop over self.variables_ we can remove the nested looping.
n_bins is being reassigned for each feature, this is not necessary for numerical, only for categorical. Could we update?
the threshold is calculated once for all variables, I think the idea is to use the same threshold for numerical, and a changing threshold for categorical. Could we reflect this in the code?
The question still remains: should we expose p_values?
it would be great if we could add the threshold calculation formula in the init docstrings as well.
@glevv mentioned that the calculation would not work for this example: current implementation will give erroneous results if you have some differences in categories in the same columns (cat_1_train have categories 'A', 'B' and 'C', and cat_1_test have categories 'A', 'B' and 'D'). Could we add that as a test and make sure it works somehow? or raises an exception to alert users on what to do?

Thank you for the great work!

feature_engine/selection/drop_psi_features.py

solegalli · 2023-04-21T07:22:00Z

feature_engine/selection/drop_psi_features.py

+            if self.strategy == "equal_width":
+                bucketer = EqualWidthDiscretiser(bins=self.bins)
+            else:
+                bucketer = EqualFrequencyDiscretiser(q=self.bins)

        # Compute the PSI by looping over the features
        self.psi_values_ = {}
        self.features_to_drop_ = []

        for feature in self.variables_:


This one still needs refactoring @dlaprins

feature_engine/selection/drop_psi_features.py

solegalli · 2023-04-21T07:41:36Z

feature_engine/selection/drop_psi_features.py

+            if feature in self.num_variables_:
+                basis_discrete = bucketer.fit_transform(basis_df[[feature]].dropna())
+                test_discrete = bucketer.transform(test_df[[feature]].dropna())
+                n_bins = self.bins


We are re-assigning this value at each iteration. For numerical variables it is not necessary, because it is always self.bins.

For categorical variables, n_bins varies, but then, we do not calculate the threshold in the loop, we calculate it once and for all variables as it currently stands.

I agree, if psi selection is extended to categorical features, then we need to use list of auto thresholds depending on number of bins in these features.

solegalli · 2023-04-21T07:42:47Z

feature_engine/selection/drop_psi_features.py

@@ -419,6 +434,16 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None):
            self.psi_values_[feature] = np.sum(
                (test_distrib - basis_distrib) * np.log(test_distrib / basis_distrib)
            )
+
+            # Determine the appropriate threshold
+            if self.threshold == "auto":


we are using a unique threshold for all variables as it stands now. And it will be calculated based on n_bins, which as it stands is the n of categories of the last variable examined. This needs fixing.

Hi @solegalli,

Let me try to address all the comments (including the other ones you left) in this one reply:

Hey @dlaprins

Thanks for the code changes. I went over the code and discussion, here are some more thoughts:

* [ ] we are now not capturing numerical and categorical variables in new attributes, that's great. Later on in the logic we are still looping over self.num_variables_ even though it does not exist any more. Could you please fix?

An oversight indeed.

* [ ] It is not necessary to loop over self.variables_ we can remove the nested looping.

I'm a bit confused here. What do you mean by nested looping? As it stands, I believe there is a single for-loop over self.variables and only an if-statement inside of that for-loop, no other for-loops (for what it's worth: the for-loop over self.variables is necessary due to the dropna(), I tried vectorizing the for-loop before I noticed this and got wrong results).

* [ ] n_bins is being reassigned for each feature, this is not necessary for numerical, only for categorical. Could we update?

The only way I see to update this is to use 2 loops, one over categoricals and one over numericals. Then n_bins doesn't have to be checked for each feature in the numerical loop. Is that what you have in mind?

* [ ] the threshold is calculated once for all variables, I think the idea is to use the same threshold for numerical, and a changing threshold for categorical. Could we reflect this in the code?

The threshold calculation is inside the loop over self.variables. Similar to n_bins, we can have a single numerical threshold outside of a for-loop over numericals, while for categoricals the threshold has to be calculated inside of the for-loop.

* [ ] The question still remains: should we expose p_values?

I believe we should. As a user I would find it rather unpleasant if I cannot set my own p-value threshold for a statistical test. It's fine if a user does not know what p_value they should set, as there's a default value for the p-value that they can use.

* [ ] it would be great if we could add the threshold calculation formula in the init docstrings as well.

Will do

* [ ] @glevv mentioned that the calculation would not work for this example: current implementation will give erroneous results if you have some differences in categories in the same columns (cat_1_train have categories 'A', 'B' and 'C', and cat_1_test have categories 'A', 'B' and 'D'). Could we add that as a test and make sure it works somehow? or raises an exception to alert users on what to do?

Is this really the case? I believe this is tested for by Claudio's test (see line 35, 85, 100 of test_drop_high_psi_features.py): "drift_cat_1" is exactly such that basis_test contains only "A" while test_df contains only "B". This is also why I got errors before modifying _observation_frequency_per_bin() when I changed the "drift_cat_1" feature in the test to be a categorical feature. Please correct me if I'm wrong here.

Thank you for the great work!

Thank you for the suggestions on how to improve the code.

I have left the loop over self.variables as it is for now, but if you would prefer to have 2 loops instead (or if I'm just not understanding you properly), let me know and I'll correct it.

Hi @dlaprins yes, that was what I meant, 2 for loops instead of the for and the ifs. Sorry I wasn't clear.

Thank you for answering all the points!

solegalli · 2023-04-21T07:46:39Z

feature_engine/selection/drop_psi_features.py

-        # taking q = 0.999 to get higher threshold
-        return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
+    def _calculate_auto_threshold(self, N, M, bins):
+        """Threshold computation for chi-square test.


Hey, thanks for the details. I went over the discussion in the issue. To summarize:

if user sets 10 bins for the numerical variables, bins=10

for categorical variables bins varies. Some may be 3, some maybe 5, some maybe huge (imagine variable cities, or postcodes)

Currently, we are yet calculating 1 threshold for all variables. The idea you propose is to have tailored thresholds, do I get this right? the same threshold for all numerical variables (they have same bins) and a varying threshold for categorical variables.

Sounds good to me. If p_value is the default, then it has backwards compatibility, but the user can also change it.

Having said this, the current code calculates 1 threshold for all variables, we need to change this.

solegalli · 2023-04-21T07:50:45Z

feature_engine/selection/drop_psi_features.py

-        else:
-            bucketer = EqualFrequencyDiscretiser(q=self.bins)
+        # set up the discretizer for numerical variables
+        if len(self.num_variables_) > 0:


@glevv what's your view on this?

solegalli · 2023-04-21T07:56:22Z

feature_engine/selection/drop_psi_features.py

-        # taking q = 0.999 to get higher threshold
-        return stats.chi2.ppf(q, self.bins - 1) * (1.0 / N + 1.0 / M)
+    def _calculate_auto_threshold(self, N, M, bins):
+        """Threshold computation for chi-square test.


Coming back to this issue: You are right that we need n_bins as a parameter of autothreshold if we test categoricals because the bins vary.

The question still remains: should we expose p_values?

My thinking: most users will not be familiar with this implementation of the autothreshold calculation, and they probably won't know what a good value for p_value is. So, I wonder how much more value it adds to expose this parameter, as per @glevv suggestion.

What are your thoughts?

* Transform positional argument into keyword argument From pandas 2.0 any only accepts keyworkd arguments ref pandas-dev/pandas#44896 * Change how reciprocal is computed I have not fully understood why this solve the problem, but splitting the operation in 2 lines does not seem to work * Catch warnings from pandas.to_datetime Now pandas.to_datetime raises a warning when the column cannot be converted * check_dtype=False in tests datetime features Pandas dataframes created from python integers are created with int column types `int64` but the operation tested returns `int32` which caused issues * Use droplevel before merging Merging dfs with different column lelvels has been disallowed ref pandas-dev/pandas#34862 * Change expected values for months I am not sure why this caused an issue, maybe due to type casting? * run black * run black on tests * isort _variable_type_checks.py * Fix datetime_subtraction --------- Co-authored-by: Claudio Salvatore Arcidiacono <claudio.arcidiacono@mollie.com>

solegalli · 2023-04-22T12:16:24Z

FYI @dlaprins

Please rebase main, we just merged #660 :)

…ies for a certain category

dlaprins · 2023-04-24T11:57:21Z

FYI @dlaprins

Please rebase main, we just merged #660 :)

Fixed, as well as split up the loop over variables to 2 loops over cat variables and num variables separately as requested.

dlaprins added 4 commits April 17, 2023 15:11

Modified psi selection to allow for categoricals

8c1a246

Modified psi selection to allow for categoricals

e37f3ce

Typo fix

17ce0cd

Fixed PSI test

f70fe4a

dlaprins marked this pull request as draft April 18, 2023 12:56

Added correct tag to DropHighPSIFeatures class

313a5a8

dlaprins marked this pull request as ready for review April 18, 2023 13:37

dlaprins added 2 commits April 19, 2023 12:07

Fixed DropHighPSIFeatures for categorical-type columns with zero entr…

a28a332

…ies for a certain category

Minor fix

bf8544e

ClaudioSalvatoreArcidiacono reviewed Apr 19, 2023

View reviewed changes

solegalli reviewed Apr 20, 2023

View reviewed changes

solegalli mentioned this pull request Apr 20, 2023

Add support for categorical variables in PSI selector #658

Closed

Fixing PR comments

2ddf21e

solegalli reviewed Apr 21, 2023

View reviewed changes

dlaprins and others added 2 commits April 21, 2023 11:59

Fixed typos from previous fix. Added to docstring

b0f6878

dlaprins added 10 commits April 24, 2023 12:24

Modified psi selection to allow for categoricals

576ff54

Modified psi selection to allow for categoricals

6f7e16b

Typo fix

06beb7a

Fixed PSI test

424f9c1

Added correct tag to DropHighPSIFeatures class

3bd8481

Fixed DropHighPSIFeatures for categorical-type columns with zero entr…

9505b2b

…ies for a certain category

Minor fix

9a407c8

Fixing PR comments

5d30e79

Fixed typos from previous fix. Added to docstring

4f63e1e

Rebased main. Split up for-loop over cat / num variables separately.

a4f5f6e

dlaprins added 2 commits April 24, 2023 13:28

Rebased main. Split up for-loop over cat / num variables separately.

39171ca

Fixed mistake in dropping split_col from variables

4928c61

solegalli mentioned this pull request Apr 27, 2023

extend drop psi functionality to categorical variables #664

Merged

solegalli closed this in #664 May 4, 2023

Uh oh!

Extend PSI feature selection to categorical variables #657

Extend PSI feature selection to categorical variables #657

Uh oh!

Conversation

dlaprins commented Apr 18, 2023 • edited by solegalli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClaudioSalvatoreArcidiacono commented Apr 18, 2023

Uh oh!

dlaprins commented Apr 19, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glevv Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlaprins commented Apr 20, 2023

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlaprins commented Apr 18, 2023 •

edited by solegalli

Loading

glevv Apr 20, 2023 •

edited

Loading