Skip to content

[BUG] IHT always checks the probability of the first class to make the selection #848

Closed
@solegalli

Description

@solegalli

Describe the bug

cross_val_predict returns an array with the probabilities of each class. The array will have as many columns as classes the target.

Then the code takes the first vector of probabilities, that is the probabilities of the first majority class, and based on that vector it selects the samples to retain.

This is OK in binary classification, but in multiclass, the samples should be filtered out or retained based on their own class probability.

Expected Results

probabilities = cross_val_predict(
            self.estimator_,
            X,
            y,
            cv=skf,
            n_jobs=self.n_jobs,
            method="predict_proba",
        )
        
        idx_under = np.empty((0,), dtype=int)

        for target_class in np.unique(y):
            if target_class in self.sampling_strategy_.keys():

               probs = probabilities[range(len(y)), **target_class**] <==

                n_samples = self.sampling_strategy_[target_class]

                threshold = np.percentile(
                    probs[y == target_class],
                    (1.0 - (n_samples / target_stats[target_class])) * 100.0,
                )
                index_target_class = np.flatnonzero(
                    probs[y == target_class] >= threshold
                )
            else:

where target_class is the column in the array corresponding to the probability of the sample being undersampled.

In addition, the documentations suggests that IHT supports or implements 1 vs Rest for multiclass targets. But the code in its current format does not use 1 vs Rest. So it is up to the user to be aware of this.

I suggest we either make it clear in the documentation, or implement 1 vs Rest to wrap the algorithm entered by the user.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions