Skip to content

Ensemble filters do not select the top ranked features #2685

Closed
@annette987

Description

@annette987

The code below shows two methods of getting the top-ranked features from ensemble filter E-mean. Both should return the same result, if the base filters are the same, but they do not.

library(mlr)
#> Loading required package: ParamHelpers
library(survival)
library(testthat)
library(checkmate)

data(veteran)
vet.task <- makeSurvTask(id = "VET", data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)

# Below are two methods for getting the top-ranked features from E-mean
# Both should return the same result, if the base filters are the same

# Get the 5 top-ranked features generated by ensemble filter E-mean, 
# by calling the function for that filter directly
set.seed(24601)
filt = mlr:::.FilterEnsembleRegister[["E-mean"]]
cox.lrn <- makeLearner(cl="surv.coxph", id = "coxph", predict.type="response")
dat = filt$fun(vet.task, 
               base.methods = c("univariate.model.score", "randomForestSRC_importance"),
               nselect = 5,
               more.args = list("univariate.model.score"=list(perf.learner=cox.lrn))
)
ordered_dat = dat[with(dat, order(method, -value)), ]
top_features1 = ordered_dat[ordered_dat$method == "E-mean", ]$name[1:5]

# Get the 5 top-ranked features generated by ensemble filter E-mean, 
# by caling filterFeatures with filter E-mean
set.seed(24601)
new.task = filterFeatures(vet.task,
                          method="E-mean",
                          abs = 5,
                          base.methods = c("univariate.model.score", "randomForestSRC_importance"))
top_features2 = getTaskFeatureNames(new.task)
expect_equal(sort(top_features1), sort(top_features2))
#> Error: sort(top_features1) not equal to sort(top_features2).
#> 3/5 mismatches
#> x[1]: "age"
#> y[1]: "celltype.adeno"
#> 
#> x[2]: "celltype.adeno"
#> y[2]: "celltype.large"
#> 
#> x[5]: "prior"
#> y[5]: "karno"

Created on 2019-11-25 by the reprex package (v0.3.0)

There are several issues that are preventing ensemble filters from selecting the top-ranked features.

  1. Each ensemble filter calls calcBaseFilters which gets the values produced by the base filters and ranks the features in order of the value they produce. However, the method used to rank the features in this function does not work. Here is evidence:
library(testthat)

toy.data= data.frame(
  name = rep(c("a", "b", "c", "d", "e"), 2),
  method = c(rep("m1", 5), rep("m2", 5)),
  value = rep(c(0.3, 0.5, 0.1, 0.2, 0.4), 2)
)
toy.data
#>    name method value
#> 1     a     m1   0.3
#> 2     b     m1   0.5
#> 3     c     m1   0.1
#> 4     d     m1   0.2
#> 5     e     m1   0.4
#> 6     a     m2   0.3
#> 7     b     m2   0.5
#> 8     c     m2   0.1
#> 9     d     m2   0.2
#> 10    e     m2   0.4

toy.data.ranked = transform(toy.data,
                            rank = ave(1:nrow(toy.data), method,
                                       FUN = function(x) order(toy.data$value[x])))
toy.data.ranked
#>    name method value rank
#> 1     a     m1   0.3    3
#> 2     b     m1   0.5    4
#> 3     c     m1   0.1    1
#> 4     d     m1   0.2    5
#> 5     e     m1   0.4    2
#> 6     a     m2   0.3    3
#> 7     b     m2   0.5    4
#> 8     c     m2   0.1    1
#> 9     d     m2   0.2    5
#> 10    e     m2   0.4    2
expect_equal(toy.data.ranked$rank, rep(c(3, 1, 5, 4, 2), 2))
#> Error: toy.data.ranked$rank not equal to rep(c(3, 1, 5, 4, 2), 2).
#> 6/10 mismatches (average diff: 2.67)
#> [2] 4 - 1 ==  3
#> [3] 1 - 5 == -4
#> [4] 5 - 4 ==  1
#> [7] 4 - 1 ==  3
#> [8] 1 - 5 == -4
#> [9] 5 - 4 ==  1

Created on 2019-11-25 by the reprex package (v0.3.0)

  1. In filterFeatures.R, the method to use for the final ranking must be selected. This is done by the line below, but this has no effect:
    fval = fval[fval$method == fval$method, ]
    As a result fval still contains the data for all methods, not just the ensemble method.

3.When the top features are selected by setting a threshold, this must be done after the subsetting referred to in point 2. If not nselect can be greater than the total number of features.

  if (select == "threshold") {
    nselect = sum(fval[["value"]] >= threshold, na.rm = TRUE)
  }

If fval contains data for all methods, there will be multiple rows for each feature - one per method. So the calculation above can include the same feature multiple times and thus nselect could be greater than the number of features.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions