Description
The code below shows two methods of getting the top-ranked features from ensemble filter E-mean. Both should return the same result, if the base filters are the same, but they do not.
library(mlr)
#> Loading required package: ParamHelpers
library(survival)
library(testthat)
library(checkmate)
data(veteran)
vet.task <- makeSurvTask(id = "VET", data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)
# Below are two methods for getting the top-ranked features from E-mean
# Both should return the same result, if the base filters are the same
# Get the 5 top-ranked features generated by ensemble filter E-mean,
# by calling the function for that filter directly
set.seed(24601)
filt = mlr:::.FilterEnsembleRegister[["E-mean"]]
cox.lrn <- makeLearner(cl="surv.coxph", id = "coxph", predict.type="response")
dat = filt$fun(vet.task,
base.methods = c("univariate.model.score", "randomForestSRC_importance"),
nselect = 5,
more.args = list("univariate.model.score"=list(perf.learner=cox.lrn))
)
ordered_dat = dat[with(dat, order(method, -value)), ]
top_features1 = ordered_dat[ordered_dat$method == "E-mean", ]$name[1:5]
# Get the 5 top-ranked features generated by ensemble filter E-mean,
# by caling filterFeatures with filter E-mean
set.seed(24601)
new.task = filterFeatures(vet.task,
method="E-mean",
abs = 5,
base.methods = c("univariate.model.score", "randomForestSRC_importance"))
top_features2 = getTaskFeatureNames(new.task)
expect_equal(sort(top_features1), sort(top_features2))
#> Error: sort(top_features1) not equal to sort(top_features2).
#> 3/5 mismatches
#> x[1]: "age"
#> y[1]: "celltype.adeno"
#>
#> x[2]: "celltype.adeno"
#> y[2]: "celltype.large"
#>
#> x[5]: "prior"
#> y[5]: "karno"
Created on 2019-11-25 by the reprex package (v0.3.0)
There are several issues that are preventing ensemble filters from selecting the top-ranked features.
- Each ensemble filter calls calcBaseFilters which gets the values produced by the base filters and ranks the features in order of the value they produce. However, the method used to rank the features in this function does not work. Here is evidence:
library(testthat)
toy.data= data.frame(
name = rep(c("a", "b", "c", "d", "e"), 2),
method = c(rep("m1", 5), rep("m2", 5)),
value = rep(c(0.3, 0.5, 0.1, 0.2, 0.4), 2)
)
toy.data
#> name method value
#> 1 a m1 0.3
#> 2 b m1 0.5
#> 3 c m1 0.1
#> 4 d m1 0.2
#> 5 e m1 0.4
#> 6 a m2 0.3
#> 7 b m2 0.5
#> 8 c m2 0.1
#> 9 d m2 0.2
#> 10 e m2 0.4
toy.data.ranked = transform(toy.data,
rank = ave(1:nrow(toy.data), method,
FUN = function(x) order(toy.data$value[x])))
toy.data.ranked
#> name method value rank
#> 1 a m1 0.3 3
#> 2 b m1 0.5 4
#> 3 c m1 0.1 1
#> 4 d m1 0.2 5
#> 5 e m1 0.4 2
#> 6 a m2 0.3 3
#> 7 b m2 0.5 4
#> 8 c m2 0.1 1
#> 9 d m2 0.2 5
#> 10 e m2 0.4 2
expect_equal(toy.data.ranked$rank, rep(c(3, 1, 5, 4, 2), 2))
#> Error: toy.data.ranked$rank not equal to rep(c(3, 1, 5, 4, 2), 2).
#> 6/10 mismatches (average diff: 2.67)
#> [2] 4 - 1 == 3
#> [3] 1 - 5 == -4
#> [4] 5 - 4 == 1
#> [7] 4 - 1 == 3
#> [8] 1 - 5 == -4
#> [9] 5 - 4 == 1
Created on 2019-11-25 by the reprex package (v0.3.0)
- In filterFeatures.R, the method to use for the final ranking must be selected. This is done by the line below, but this has no effect:
fval = fval[fval$method == fval$method, ]
As a result fval still contains the data for all methods, not just the ensemble method.
3.When the top features are selected by setting a threshold, this must be done after the subsetting referred to in point 2. If not nselect can be greater than the total number of features.
if (select == "threshold") {
nselect = sum(fval[["value"]] >= threshold, na.rm = TRUE)
}
If fval contains data for all methods, there will be multiple rows for each feature - one per method. So the calculation above can include the same feature multiple times and thus nselect could be greater than the number of features.