Description
Describe the bug
cross_val_predict returns an array with the probabilities of each class. The array will have as many columns as classes the target.
Then the code takes the first vector of probabilities, that is the probabilities of the first majority class, and based on that vector it selects the samples to retain.
This is OK in binary classification, but in multiclass, the samples should be filtered out or retained based on their own class probability.
Expected Results
probabilities = cross_val_predict(
self.estimator_,
X,
y,
cv=skf,
n_jobs=self.n_jobs,
method="predict_proba",
)
idx_under = np.empty((0,), dtype=int)
for target_class in np.unique(y):
if target_class in self.sampling_strategy_.keys():
probs = probabilities[range(len(y)), **target_class**] <==
n_samples = self.sampling_strategy_[target_class]
threshold = np.percentile(
probs[y == target_class],
(1.0 - (n_samples / target_stats[target_class])) * 100.0,
)
index_target_class = np.flatnonzero(
probs[y == target_class] >= threshold
)
else:
where target_class is the column in the array corresponding to the probability of the sample being undersampled.
In addition, the documentations suggests that IHT supports or implements 1 vs Rest for multiclass targets. But the code in its current format does not use 1 vs Rest. So it is up to the user to be aware of this.
I suggest we either make it clear in the documentation, or implement 1 vs Rest to wrap the algorithm entered by the user.