Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IHT always checks the probability of the first class to make the selection #848

Closed
solegalli opened this issue Aug 3, 2021 · 1 comment · Fixed by #1013
Closed

Comments

@solegalli
Copy link
Contributor

solegalli commented Aug 3, 2021

Describe the bug

cross_val_predict returns an array with the probabilities of each class. The array will have as many columns as classes the target.

Then the code takes the first vector of probabilities, that is the probabilities of the first majority class, and based on that vector it selects the samples to retain.

This is OK in binary classification, but in multiclass, the samples should be filtered out or retained based on their own class probability.

Expected Results

probabilities = cross_val_predict(
            self.estimator_,
            X,
            y,
            cv=skf,
            n_jobs=self.n_jobs,
            method="predict_proba",
        )
        
        idx_under = np.empty((0,), dtype=int)

        for target_class in np.unique(y):
            if target_class in self.sampling_strategy_.keys():

               probs = probabilities[range(len(y)), **target_class**] <==

                n_samples = self.sampling_strategy_[target_class]

                threshold = np.percentile(
                    probs[y == target_class],
                    (1.0 - (n_samples / target_stats[target_class])) * 100.0,
                )
                index_target_class = np.flatnonzero(
                    probs[y == target_class] >= threshold
                )
            else:

where target_class is the column in the array corresponding to the probability of the sample being undersampled.

In addition, the documentations suggests that IHT supports or implements 1 vs Rest for multiclass targets. But the code in its current format does not use 1 vs Rest. So it is up to the user to be aware of this.

I suggest we either make it clear in the documentation, or implement 1 vs Rest to wrap the algorithm entered by the user.

@glemaitre
Copy link
Member

Then the code takes the first vector of probabilities

It takes the probabilities associated with the true label. The idea will be to keep the samples with high probabilities because it means that this example is easy to be truly classified. Then we loop other classes to iterate and therefore we select samples for each class of interest.

The documentation regarding the multiclass is indeed wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants