Filter: Add ensemble filter methods #2456

pat-s · 2018-10-19T16:54:25Z

Purpose

In the filtering/feature selection field, the idea of ensemble filter methods becomes more and more popular. Ensemble filters aggregate the rankings of multiple single filters and create a new ranking.

This approach has shown to be superior to single filter methods, e.g. https://ieeexplore.ieee.org/document/8250495.

Implementation

I decided to establish a new class "FilterEnsemble" which distinguishes the ensemble filter normal the single filters. This decision has (as always) positive and negative side-effects.

Ensemble filters are created in the same way as normal filters in their own file, R/FilterEnsemble.R.
They share the same class structure with some minor differences:

no pkg, supported.tasks, supported.features arguments (all checked by the simple filters)
new argument basal.methods which stands for "single filter methods"

Calculation is done as usual via generateFilterValuesData() or filterFeatures():

First, a FilterValues object is created as usual by calling generateFilterValuesData() with the single filters.
Then, the specific ensemble filter calculations are done on the FilterValues object (e.g. taking the mean across all voters for each feature).

Notation

Notation differs a bit among the functions.
In generateFilterValuesData(), an ensemble method is passed in a list together with its required simple methods, .e.g.:

generateFilterValuesData(iris.task, 
  method = list("E-min", c('gain.ratio','information.gain')))

To make makeFilterWrapper() flexible in the sense that the single methods, which an ensemble method uses, should be tunable, a new argument base.methods was introduced. It depends on a ensemble method set either in filterFeatures(method = "") or in makefilterWrapper(fw.method = "").

makeFilterWrapper(lrn, fw.method = "E-min", 
  fw.base.methods = c("gain.ratio", "information.gain"),

filterFeatures(iris.task, method = "E-min", 
  base.methods = c("gain.ratio", "information.gain"), abs = 2)

This gives the user the option to tune

over multiple filters set in fw.method
over multiple single filters if an ensemble filter is within fw.method

Tuning simple filters is not supported due to the lack of sampling without replacement for DiscreteVectorParams in ParamHelpers mlr-org/ParamHelpers#206

As multiple rankings are calculated and returned when using an ensemble filter, filterFeatures() will always prioritize the ensemble method unless a different method is set via the new select.method argument.
This only applies if filterFeatures() is called directly as in the wrapper only one filter method is for subsetting anyway (and in the ensemble case, the prioritizing of the ensemble method applies).

Other changes

getFilterValuesData() now returns a tbl instead of a data.frame (I think there is no reason not to use enhanced data.frame output. I does not harm any internal processes.)
plotFilterValues() got a bit "smarter" and easier now regarding the ordering of multiple facets
I added multiple examples to the help pages of filterFeatures(), generateFilterValuesData() and makeFilterWrapper()
Instead of a wide data.frame the values are now returned in a long (tidy) data.frame. This makes it easier to apply post-processing methods (like group_by() calls etc)

To-do

tests
Cache filterValues in a tuning process (and don't recalculate them all the time) Tuning with filters recalculates filter values each iteration #1995

Examples

library(mlr)
#> Loading required package: ParamHelpers
#> Warning: replacing previous import 'stats::filter' by 'dplyr::filter' when
#> loading 'mlr'
fval = generateFilterValuesData(iris.task, 
  method = list("E-mean", c("gain.ratio", "information.gain")))
fval
#> FilterValues:
#> Task: iris-example
#> # A tibble: 12 x 4
#>    name         type    method           value
#>    <chr>        <chr>   <chr>            <dbl>
#>  1 Petal.Width  numeric E-mean           4    
#>  2 Petal.Length numeric E-mean           3    
#>  3 Sepal.Length numeric E-mean           2    
#>  4 Sepal.Width  numeric E-mean           1    
#>  5 Petal.Width  numeric gain.ratio       0.871
#>  6 Petal.Length numeric gain.ratio       0.858
#>  7 Sepal.Length numeric gain.ratio       0.420
#>  8 Sepal.Width  numeric gain.ratio       0.247
#>  9 Petal.Width  numeric information.gain 0.955
#> 10 Petal.Length numeric information.gain 0.940
#> 11 Sepal.Length numeric information.gain 0.452
#> 12 Sepal.Width  numeric information.gain 0.267

filterFeatures(iris.task, method = "E-min", 
  base.methods = c("gain.ratio", "information.gain"), abs = 2)
#> Supervised task: iris-example
#> Type: classif
#> Target: Species
#> Observations: 150
#> Features:
#>    numerics     factors     ordered functionals 
#>           2           0           0           0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Has coordinates: FALSE
#> Classes: 3
#>     setosa versicolor  virginica 
#>         50         50         50 
#> Positive class: NA


### makeFilterWrapper(), can ofc also be used within tuneParams()
task = makeClassifTask(data = iris, target = "Species")
lrn = makeLearner("classif.lda")
inner = makeResampleDesc("Holdout")
outer = makeResampleDesc("CV", iters = 2)

# usage of an ensemble filter
lrn = makeFilterWrapper(makeLearner("classif.lda"), fw.method = "E-Borda",
  fw.base.methods = c("gain.ratio", "information.gain"),
  fw.perc = 0.5)
r = resample(lrn, task, outer, extract = function(model) {
  getFilteredFeatures(model)
})
#> Resampling: cross-validation
#> Measures:             mmce
#> [Resample] iter 1:    0.0533333
#> [Resample] iter 2:    0.0533333
#> 
#> Aggregated Result: mmce.test.mean=0.0533333
#> 
print(r$extract)
#> [[1]]
#> [1] "Petal.Length" "Petal.Width" 
#> 
#> [[2]]
#> [1] "Petal.Length" "Petal.Width"

plotFilterValues(fval)

^{Created on 2018-10-22 by the reprex package (v0.2.1)}

larskotthoff · 2018-10-19T19:56:07Z

R/FilterWrapper.R

@@ -26,12 +26,18 @@
 #'   Mutually exclusive with arguments `fw.perc` and `fw.abs`.
 #' @param fw.mandatory.feat ([character])\cr
 #'   Mandatory features which are always included regardless of their scores
+#' @param ensemble.method ([character])\cr
+#'   Which ensemble method should be used. Can only be used with >= 2 filter methods.


How exactly does this work? You can only specify one method in the wrapper, can't you?

Also why is this a character? Comments and code below suggest that this is a logical value.

No, you can use multiple methods.

Also why is this a character? Comments and code below suggest that this is a logical value.

I'll check again. But as said, you can use multiple ones.

larskotthoff · 2018-10-19T20:00:36Z

R/generateFilterValues.R

 #' @template arg_task
 #' @param method ([character])\cr
 #'   Filter method(s), see above.
 #'   Default is \dQuote{randomForestSRC.rfsrc}.
 #' @param nselect (`integer(1)`)\cr
 #'   Number of scores to request. Scores are getting calculated for all features per default.
+#' @param ensemble.method ([character])\cr
+#'   Ensemble filter method to use. Can only be used with >= 2 filter methods.


Should be consistent with wrapper -- character or logical?

Nothing is finished yet :)

larskotthoff · 2018-10-19T20:01:28Z

R/generateFilterValues.R

+
+    ### ensemble rank aggregation
+
+    if (any(c("E-min", "E-mean", "E-median", "E-max", "E-Borda") %in% ensemble.method)) {


Possible values for ensemble method should be documented.

Yep, I will ofc do this.

larskotthoff · 2018-10-19T20:10:39Z

tests/testthat/test_base_generateFilterValuesData.R

+
+test_that("ensemble methods work", {
+  fi = generateFilterValuesData(multiclass.task, method = c('gain.ratio','information.gain'),
+                                ensemble.method = c("E-Borda", "E-min"))


What does it mean if multiple ensemble methods are specified?

The same as if you would use multiple single ones. You get back a DF with all listed rankings when you use generateFilterValuesData().

mb706 · 2018-10-20T12:17:01Z

Instead of putting lots of special code into the generateFilterValuesData function the ensemble stuff should probably happen somewhere else. My suggestion is that the filter code should be changed to also accept functions or Filter objects (i.e. the objects found in mlr:::.FilterRegister). Ensembles (and other interesting things) could then be implemented using functionals that create new filters from existing ones

filterFeatures(pid.task, "univariate.model.score", abs = 3,
  perf.learner = "classif.logreg")
# (there should probably be a better way to access this than by ":::")
filterFeatures(pid.task, mlr:::.FilterRegister$univariate.model.score, abs = 3,
  perf.learner = "classif.logreg")
filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
  abs = 3,
  univariate.model.score.perf.learn = "classif.logreg")
# alternative:
filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
    filter.args = list(univariate.model.score = list(perf.learn = "classif.logreg")),
  abs = 3)

In these examples, the makeFilterEnsemble method would return a Filter object (mostly a function with some metadata about allowed task types) that does the ensemble things internally; "generateFilterValuesData" should not be involved in this and call the metafilter just the same way it calls an ordinary filter.

pat-s · 2018-10-20T20:39:51Z

Hi guys,

This is all WIP here. The main idea is to use them in makeFilterWrapper(). I'll come back to your comments later.

pat-s · 2018-10-21T08:28:33Z

@mb706 Thanks for your input.

Sounds like a good idea. I prefer the following notation:

filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
  abs = 3,
  univariate.model.score.perf.learn = "classif.logreg")

pat-s · 2018-10-21T22:22:47Z

new class FilterEnsemble
new listFilterEnsembleMethods() etc
wrapper working
filterFeatures() and generateFilterValuesData() working

Doc, tests and more concrete examples in the next days.

pat-s · 2018-10-22T11:16:53Z

@larskotthoff @mb706
Looking forward to your comments now - see first post.

larskotthoff · 2018-10-22T17:06:32Z

Looks like builds are failing...

pat-s · 2018-10-22T17:44:06Z

Tests etc are still missing. It's more about the general approach before fixing all the details and then changing everything again.

Would be great if we could talk about the "big picture" 🙂

Build URL: https://travis-ci.org/mlr-org/mlr/builds/547668906 Commit: 23bc3cd

jakob-r · 2019-06-19T13:01:37Z

NEWS.md

@@ -1,5 +1,9 @@
 # mlr 2.14.0.9000

+## Breaking
+
+- Instead of a wide `data.frame` filter values are now returned in a long (tidy) `tibble`. This makes it easier to apply post-processing methods (like `group_by()`, etc) (@pat-s, #2456)


I could imagine that this is actually something more people programmed against. So a breaking change here might irritate some people. Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.
Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?

I could imagine that this is actually something more people programmed against.

This refers to returning a long DF?
Changing this again would involve several hours since I have to re-arrange all outputs..

Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?

Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.

Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.

Yes, sure. For now it is only used for printing, not internally. (i.e. the DF is coerced right before its returned). Which ofc makes no difference for the Import of tibble. I just hate it to print a DF that fills my console to Inf...

This refers to returning a long DF?

Yes

Changing this again would involve several hours since I have to re-arrange all outputs..

Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.

Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.

Can you point me to it?

I just hate it to print a DF that fills my console to Inf...

Then just add the following to your .Rprofile No need to add a whole package to the dependencies.

if (interactive() && "tibble" %in% rownames(utils::installed.packages())) { print.data.frame = function(x, ...) { tibble:::print.tbl(tibble::as_tibble(x), ...) } }

Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.

It will break code since the structure of the returned filter value is different, yes.

It is something that could make a reverse dependency check worth it before going on cran.

Yes, I always do that.

Then just add the following to your .Rprofile No need to add a whole package to the dependencies.

Nice hack. I'll use it :) - and get rid of using tibble then in the package.

Can you point me to it?

mlr/R/generateFilterValues.R

Line 142 in edf0142

out = tidyr::gather(out, method, "value", !!dplyr::enquo(method))

I tried a lot of non-dplyr stuff here but eventually gave up.

I updated it using melt from data.table

🚀
so easy, damn...

pat-s · 2019-06-22T07:54:58Z

I assume that the ensemble filters are not using caching since I see long runtimes for them in my study. The aggregation step cannot cause this so I assume the simple filters are not being taken from the cache. Have to inspect.

pat-s · 2019-06-23T08:58:40Z

I made a mistake during merging (most likely) - caching was not used so far in this PR because the memoized function was not used. See 049969f. Fixed it now.

I was wondering heavily why everything took so long in my project.. 🙄 🤦‍♂️

Build URL: https://travis-ci.org/mlr-org/mlr/builds/549827002 Commit: 4d4dca6

pat-s · 2019-06-24T19:52:09Z

@larskotthoff @jakob-r
I guess we're good for now here. If I encounter more issues along the way I'll fix them separately.

@jakob-r If you approve your review, feel free to merge.

jakob-r · 2019-06-25T07:38:45Z

DESCRIPTION

@@ -207,6 +208,7 @@ Suggests:
    LiblineaR,
    lintr (>= 1.0.0.9001),
    MASS,
+    magrittr,


Not needed.

jakob-r · 2019-06-25T07:40:14Z

DESCRIPTION

@@ -147,6 +147,7 @@ Imports:
    ggplot2,
    methods,
    parallelMap (>= 1.3),
+    rlang,


where is .data used?

Leftover 👍 See commit.

Build URL: https://travis-ci.org/mlr-org/mlr/builds/551632468 Commit: e63d863

pat-s · 2019-07-01T14:52:00Z

@jakob-r Mergeable now?

Build URL: https://travis-ci.org/mlr-org/mlr/builds/560455073 Commit: 345da05

pat-s · 2019-07-19T13:46:27Z

Merging now.

* wip * save state * fs ensemble working * add fs ens test * update plotFilterValues() * new approach * add example to filterFeatures * tuneParams() adjustments * tests for generateFilterValuesData() and makefilterWrapper() * allow only passing basal.methods * add tuneParams() checks for ensemble filters * import dplyr and magrittr * no cache * suggests: tidyr * solve dplyr::filter NS clash * trying to fix "export of global variables" note * example indentation * clean example * fix docs * fix global variables export error, add param doc * update generateFilterValuesData tests * indent * account for nselect = 0 * remove getFilterValues test * try using SE in tidyr::gather * fix global variables warning * next SE attempt * update tests * update test * update filters * fix naming * fix tests * more examples * adjust NS * fix tic.R * update test_FilterWrapper * don't set length of fw.basal.methods * add benchmark test * fix filter names * update tests * get tutorial passing * solve NS clash * style * NS and man * remove purrr dep * remove dplyr, purrr and magrittr imports * fix tests * remove browser() leftover * fix coercin > 1 error * fix brackets * Deploy from Travis build 13854 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/540057376 Commit: 7b72274 * basal.methods -> base.methods * Deploy from Travis build 13871 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/541796595 Commit: 3fad99c * move redundant code into helper function * reduce simple filters in benchmark test * style * we cannot tune the simple methods currently * don't allow list specification of ensemble filters through `methods` argument * document list notation for method arg in `generateFilterValuesData()` * Deploy from Travis build 13907 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/544919859 Commit: 92b4c5f * support all task and feature types for ensemble filters * define ens.method in the function body * update filter table * Deploy from Travis build 13934 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547161246 Commit: f23070f * fix plotFilterValues() * Deploy from Travis build 13935 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547260271 Commit: 1f547fa * Deploy from Travis build 13938 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547612585 Commit: e4258f6 * revert unwanted change * add NEWS * Deploy from Travis build 13946 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547668906 Commit: 23bc3cd * remove tibble * use data.table::melt * fix data.table::melt * fix caching * fix R CMD check notes, remove unused argument from makeFilterEnsemble() * Deploy from Travis build 13971 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/549827002 Commit: 4d4dca6 * rlang and magrittr not used anymore * fix NS * add info how to pass filter args when using ensemble filters * Deploy from Travis build 13993 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/551632468 Commit: e63d863 * Deploy from Travis build 14023 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/560455073 Commit: 345da05

pat-s added 5 commits October 12, 2018 18:28

wip

e89804f

save state

7b8e8f7

fs ensemble working

00ff999

add fs ens test

53c2c72

update plotFilterValues()

6ba5407

pat-s added type-enhancement pr-work in progress - not done project - base labels Oct 19, 2018

Merge branch 'master' into fs-ensemble

422310e

larskotthoff requested changes Oct 19, 2018

View reviewed changes

new approach

95ac582

add example to filterFeatures

eaa81da

pat-s added 11 commits October 22, 2018 22:19

Merge branch 'master' into fs-ensemble

e5d5a80

Merge branch 'fs-ensemble' of github.com:mlr-org/mlr into fs-ensemble

0b9e5c4

tuneParams() adjustments

496a91c

tests for generateFilterValuesData() and makefilterWrapper()

3901658

allow only passing basal.methods

caf5aad

add tuneParams() checks for ensemble filters

cd8306e

Merge branch 'master' into fs-ensemble

1745748

import dplyr and magrittr

1c8325d

no cache

e7b6c55

suggests: tidyr

3296a5b

solve dplyr::filter NS clash

0905b39

pat-s and others added 2 commits June 19, 2019 13:29

Merge branch 'master' into fs-ensemble

23bc3cd

Deploy from Travis build 13946 [ci skip]

88b5954

Build URL: https://travis-ci.org/mlr-org/mlr/builds/547668906 Commit: 23bc3cd

jakob-r reviewed Jun 19, 2019

View reviewed changes

pat-s and others added 3 commits June 21, 2019 13:23

remove tibble

edf0142

use data.table::melt

34813a0

fix data.table::melt

7e83abb

fix caching

049969f

pat-s and others added 3 commits June 23, 2019 20:12

fix R CMD check notes, remove unused argument from makeFilterEnsemble()

cd37f00

Merge branch 'master' into fs-ensemble

4d4dca6

Deploy from Travis build 13971 [ci skip]

0514671

Build URL: https://travis-ci.org/mlr-org/mlr/builds/549827002 Commit: 4d4dca6

jakob-r requested changes Jun 25, 2019

View reviewed changes

pat-s and others added 8 commits June 25, 2019 10:16

rlang and magrittr not used anymore

d89a19b

Merge branch 'fs-ensemble' of github.com:mlr-org/mlr into fs-ensemble

234f855

fix NS

94d95b8

Merge branch 'master' into fs-ensemble

8dafb80

add info how to pass filter args when using ensemble filters

d257f9f

Merge branch 'master' into fs-ensemble

e63d863

Deploy from Travis build 13993 [ci skip]

17a2ca9

Build URL: https://travis-ci.org/mlr-org/mlr/builds/551632468 Commit: e63d863

Merge branch 'master' into fs-ensemble

555ff2e

pat-s and others added 3 commits July 18, 2019 14:10

Merge branch 'master' into fs-ensemble

2e7f5fd

Merge branch 'master' into fs-ensemble

345da05

Deploy from Travis build 14023 [ci skip]

f110f3e

Build URL: https://travis-ci.org/mlr-org/mlr/builds/560455073 Commit: 345da05

pat-s merged commit 3092400 into master Jul 19, 2019

pat-s deleted the fs-ensemble branch July 19, 2019 13:46


		### ensemble rank aggregation

		if (any(c("E-min", "E-mean", "E-median", "E-max", "E-Borda") %in% ensemble.method)) {

Uh oh!

Filter: Add ensemble filter methods #2456

Filter: Add ensemble filter methods #2456

Uh oh!

Conversation

pat-s commented Oct 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Implementation

Notation

Other changes

To-do

Examples

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mb706 commented Oct 20, 2018

Uh oh!

pat-s commented Oct 20, 2018

Uh oh!

pat-s commented Oct 21, 2018

Uh oh!

pat-s commented Oct 21, 2018

Uh oh!

pat-s commented Oct 22, 2018

Uh oh!

larskotthoff commented Oct 22, 2018

Uh oh!

pat-s commented Oct 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakob-r Jun 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pat-s Jun 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pat-s commented Jun 22, 2019

Uh oh!

pat-s commented Jun 23, 2019

Uh oh!

pat-s commented Jun 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pat-s commented Jul 1, 2019

Uh oh!

pat-s commented Jul 19, 2019

Uh oh!

Uh oh!

pat-s commented Oct 19, 2018 •

edited

Loading

jakob-r Jun 21, 2019 •

edited

Loading

pat-s Jun 21, 2019 •

edited

Loading