Skip to content

Filter: Add ensemble filter methods #2456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 115 commits into from
Jul 19, 2019
Merged

Filter: Add ensemble filter methods #2456

merged 115 commits into from
Jul 19, 2019

Conversation

pat-s
Copy link
Member

@pat-s pat-s commented Oct 19, 2018

Purpose

In the filtering/feature selection field, the idea of ensemble filter methods becomes more and more popular. Ensemble filters aggregate the rankings of multiple single filters and create a new ranking.

This approach has shown to be superior to single filter methods, e.g. https://ieeexplore.ieee.org/document/8250495.

Implementation

I decided to establish a new class "FilterEnsemble" which distinguishes the ensemble filter normal the single filters. This decision has (as always) positive and negative side-effects.

Ensemble filters are created in the same way as normal filters in their own file, R/FilterEnsemble.R.
They share the same class structure with some minor differences:

  • no pkg, supported.tasks, supported.features arguments (all checked by the simple filters)
  • new argument basal.methods which stands for "single filter methods"

Calculation is done as usual via generateFilterValuesData() or filterFeatures():

  1. First, a FilterValues object is created as usual by calling generateFilterValuesData() with the single filters.
  2. Then, the specific ensemble filter calculations are done on the FilterValues object (e.g. taking the mean across all voters for each feature).

Notation

Notation differs a bit among the functions.
In generateFilterValuesData(), an ensemble method is passed in a list together with its required simple methods, .e.g.:

generateFilterValuesData(iris.task, 
  method = list("E-min", c('gain.ratio','information.gain')))

To make makeFilterWrapper() flexible in the sense that the single methods, which an ensemble method uses, should be tunable, a new argument base.methods was introduced. It depends on a ensemble method set either in filterFeatures(method = "") or in makefilterWrapper(fw.method = "").

makeFilterWrapper(lrn, fw.method = "E-min", 
  fw.base.methods = c("gain.ratio", "information.gain"),
filterFeatures(iris.task, method = "E-min", 
  base.methods = c("gain.ratio", "information.gain"), abs = 2)

This gives the user the option to tune

  • over multiple filters set in fw.method
  • over multiple single filters if an ensemble filter is within fw.method

Tuning simple filters is not supported due to the lack of sampling without replacement for DiscreteVectorParams in ParamHelpers mlr-org/ParamHelpers#206

As multiple rankings are calculated and returned when using an ensemble filter, filterFeatures() will always prioritize the ensemble method unless a different method is set via the new select.method argument.
This only applies if filterFeatures() is called directly as in the wrapper only one filter method is for subsetting anyway (and in the ensemble case, the prioritizing of the ensemble method applies).

Other changes

  • getFilterValuesData() now returns a tbl instead of a data.frame (I think there is no reason not to use enhanced data.frame output. I does not harm any internal processes.)
  • plotFilterValues() got a bit "smarter" and easier now regarding the ordering of multiple facets
  • I added multiple examples to the help pages of filterFeatures(), generateFilterValuesData() and makeFilterWrapper()
  • Instead of a wide data.frame the values are now returned in a long (tidy) data.frame. This makes it easier to apply post-processing methods (like group_by() calls etc)

To-do

Examples

library(mlr)
#> Loading required package: ParamHelpers
#> Warning: replacing previous import 'stats::filter' by 'dplyr::filter' when
#> loading 'mlr'
fval = generateFilterValuesData(iris.task, 
  method = list("E-mean", c("gain.ratio", "information.gain")))
fval
#> FilterValues:
#> Task: iris-example
#> # A tibble: 12 x 4
#>    name         type    method           value
#>    <chr>        <chr>   <chr>            <dbl>
#>  1 Petal.Width  numeric E-mean           4    
#>  2 Petal.Length numeric E-mean           3    
#>  3 Sepal.Length numeric E-mean           2    
#>  4 Sepal.Width  numeric E-mean           1    
#>  5 Petal.Width  numeric gain.ratio       0.871
#>  6 Petal.Length numeric gain.ratio       0.858
#>  7 Sepal.Length numeric gain.ratio       0.420
#>  8 Sepal.Width  numeric gain.ratio       0.247
#>  9 Petal.Width  numeric information.gain 0.955
#> 10 Petal.Length numeric information.gain 0.940
#> 11 Sepal.Length numeric information.gain 0.452
#> 12 Sepal.Width  numeric information.gain 0.267

filterFeatures(iris.task, method = "E-min", 
  base.methods = c("gain.ratio", "information.gain"), abs = 2)
#> Supervised task: iris-example
#> Type: classif
#> Target: Species
#> Observations: 150
#> Features:
#>    numerics     factors     ordered functionals 
#>           2           0           0           0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Has coordinates: FALSE
#> Classes: 3
#>     setosa versicolor  virginica 
#>         50         50         50 
#> Positive class: NA


### makeFilterWrapper(), can ofc also be used within tuneParams()
task = makeClassifTask(data = iris, target = "Species")
lrn = makeLearner("classif.lda")
inner = makeResampleDesc("Holdout")
outer = makeResampleDesc("CV", iters = 2)

# usage of an ensemble filter
lrn = makeFilterWrapper(makeLearner("classif.lda"), fw.method = "E-Borda",
  fw.base.methods = c("gain.ratio", "information.gain"),
  fw.perc = 0.5)
r = resample(lrn, task, outer, extract = function(model) {
  getFilteredFeatures(model)
})
#> Resampling: cross-validation
#> Measures:             mmce
#> [Resample] iter 1:    0.0533333
#> [Resample] iter 2:    0.0533333
#> 
#> Aggregated Result: mmce.test.mean=0.0533333
#> 
print(r$extract)
#> [[1]]
#> [1] "Petal.Length" "Petal.Width" 
#> 
#> [[2]]
#> [1] "Petal.Length" "Petal.Width"

plotFilterValues(fval)

Created on 2018-10-22 by the reprex package (v0.2.1)

@@ -26,12 +26,18 @@
#' Mutually exclusive with arguments `fw.perc` and `fw.abs`.
#' @param fw.mandatory.feat ([character])\cr
#' Mandatory features which are always included regardless of their scores
#' @param ensemble.method ([character])\cr
#' Which ensemble method should be used. Can only be used with >= 2 filter methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly does this work? You can only specify one method in the wrapper, can't you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why is this a character? Comments and code below suggest that this is a logical value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can use multiple methods.

Also why is this a character? Comments and code below suggest that this is a logical value.

I'll check again. But as said, you can use multiple ones.

#' @template arg_task
#' @param method ([character])\cr
#' Filter method(s), see above.
#' Default is \dQuote{randomForestSRC.rfsrc}.
#' @param nselect (`integer(1)`)\cr
#' Number of scores to request. Scores are getting calculated for all features per default.
#' @param ensemble.method ([character])\cr
#' Ensemble filter method to use. Can only be used with >= 2 filter methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be consistent with wrapper -- character or logical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is finished yet :)


### ensemble rank aggregation

if (any(c("E-min", "E-mean", "E-median", "E-max", "E-Borda") %in% ensemble.method)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible values for ensemble method should be documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I will ofc do this.


test_that("ensemble methods work", {
fi = generateFilterValuesData(multiclass.task, method = c('gain.ratio','information.gain'),
ensemble.method = c("E-Borda", "E-min"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean if multiple ensemble methods are specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as if you would use multiple single ones. You get back a DF with all listed rankings when you use generateFilterValuesData().

@mb706
Copy link
Contributor

mb706 commented Oct 20, 2018

Instead of putting lots of special code into the generateFilterValuesData function the ensemble stuff should probably happen somewhere else. My suggestion is that the filter code should be changed to also accept functions or Filter objects (i.e. the objects found in mlr:::.FilterRegister). Ensembles (and other interesting things) could then be implemented using functionals that create new filters from existing ones

filterFeatures(pid.task, "univariate.model.score", abs = 3,
  perf.learner = "classif.logreg")
# (there should probably be a better way to access this than by ":::")
filterFeatures(pid.task, mlr:::.FilterRegister$univariate.model.score, abs = 3,
  perf.learner = "classif.logreg")
filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
  abs = 3,
  univariate.model.score.perf.learn = "classif.logreg")
# alternative:
filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
    filter.args = list(univariate.model.score = list(perf.learn = "classif.logreg")),
  abs = 3)

In these examples, the makeFilterEnsemble method would return a Filter object (mostly a function with some metadata about allowed task types) that does the ensemble things internally; "generateFilterValuesData" should not be involved in this and call the metafilter just the same way it calls an ordinary filter.

@pat-s
Copy link
Member Author

pat-s commented Oct 20, 2018

Hi guys,

This is all WIP here. The main idea is to use them in makeFilterWrapper(). I'll come back to your comments later.

@pat-s
Copy link
Member Author

pat-s commented Oct 21, 2018

@mb706 Thanks for your input.

Sounds like a good idea. I prefer the following notation:

filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
  abs = 3,
  univariate.model.score.perf.learn = "classif.logreg")

@pat-s
Copy link
Member Author

pat-s commented Oct 21, 2018

  • new class FilterEnsemble
  • new listFilterEnsembleMethods() etc
  • wrapper working
  • filterFeatures() and generateFilterValuesData() working

Doc, tests and more concrete examples in the next days.

@pat-s
Copy link
Member Author

pat-s commented Oct 22, 2018

@larskotthoff @mb706
Looking forward to your comments now - see first post.

@larskotthoff
Copy link
Member

Looks like builds are failing...

@pat-s
Copy link
Member Author

pat-s commented Oct 22, 2018

Tests etc are still missing. It's more about the general approach before fixing all the details and then changing everything again.

Would be great if we could talk about the "big picture" 🙂

@@ -1,5 +1,9 @@
# mlr 2.14.0.9000

## Breaking

- Instead of a wide `data.frame` filter values are now returned in a long (tidy) `tibble`. This makes it easier to apply post-processing methods (like `group_by()`, etc) (@pat-s, #2456)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine that this is actually something more people programmed against. So a breaking change here might irritate some people. Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.
Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine that this is actually something more people programmed against.

This refers to returning a long DF?
Changing this again would involve several hours since I have to re-arrange all outputs..

Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?

Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.

Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.

Yes, sure. For now it is only used for printing, not internally. (i.e. the DF is coerced right before its returned). Which ofc makes no difference for the Import of tibble. I just hate it to print a DF that fills my console to Inf...

Copy link
Member

@jakob-r jakob-r Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to returning a long DF?

Yes

Changing this again would involve several hours since I have to re-arrange all outputs..

Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.

Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.

Can you point me to it?

I just hate it to print a DF that fills my console to Inf...

Then just add the following to your .Rprofile No need to add a whole package to the dependencies.

if (interactive() && "tibble" %in% rownames(utils::installed.packages())) {
  print.data.frame = function(x, ...) {
    tibble:::print.tbl(tibble::as_tibble(x), ...)
  }
}

Copy link
Member Author

@pat-s pat-s Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.

It will break code since the structure of the returned filter value is different, yes.

It is something that could make a reverse dependency check worth it before going on cran.

Yes, I always do that.

Then just add the following to your .Rprofile No need to add a whole package to the dependencies.

Nice hack. I'll use it :) - and get rid of using tibble then in the package.

Can you point me to it?

out = tidyr::gather(out, method, "value", !!dplyr::enquo(method))

I tried a lot of non-dplyr stuff here but eventually gave up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it using melt from data.table

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀
so easy, damn...

@pat-s
Copy link
Member Author

pat-s commented Jun 22, 2019

I assume that the ensemble filters are not using caching since I see long runtimes for them in my study. The aggregation step cannot cause this so I assume the simple filters are not being taken from the cache. Have to inspect.

@pat-s
Copy link
Member Author

pat-s commented Jun 23, 2019

I made a mistake during merging (most likely) - caching was not used so far in this PR because the memoized function was not used. See 049969f. Fixed it now.

I was wondering heavily why everything took so long in my project.. 🙄 🤦‍♂️

@pat-s
Copy link
Member Author

pat-s commented Jun 24, 2019

@larskotthoff @jakob-r
I guess we're good for now here. If I encounter more issues along the way I'll fix them separately.

@jakob-r If you approve your review, feel free to merge.

DESCRIPTION Outdated
@@ -207,6 +208,7 @@ Suggests:
LiblineaR,
lintr (>= 1.0.0.9001),
MASS,
magrittr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

DESCRIPTION Outdated
@@ -147,6 +147,7 @@ Imports:
ggplot2,
methods,
parallelMap (>= 1.3),
rlang,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is .data used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover 👍 See commit.

@pat-s
Copy link
Member Author

pat-s commented Jul 1, 2019

@jakob-r Mergeable now?

@pat-s
Copy link
Member Author

pat-s commented Jul 19, 2019

Merging now.

@pat-s pat-s merged commit 3092400 into master Jul 19, 2019
@pat-s pat-s deleted the fs-ensemble branch July 19, 2019 13:46
vrodriguezf pushed a commit to vrodriguezf/mlr that referenced this pull request Jan 16, 2021
* wip

* save state

* fs ensemble working

* add fs ens test

* update plotFilterValues()

* new approach

* add example to filterFeatures

* tuneParams() adjustments

* tests for generateFilterValuesData() and makefilterWrapper()

* allow only passing basal.methods

* add tuneParams() checks for ensemble filters

* import dplyr and magrittr

* no cache

* suggests: tidyr

* solve dplyr::filter NS clash

* trying to fix "export of global variables" note

* example indentation

* clean example

* fix docs

* fix global variables export error, add param doc

* update generateFilterValuesData tests

* indent

* account for nselect = 0

* remove getFilterValues test

* try using SE in tidyr::gather

* fix global variables warning

* next SE attempt

* update tests

* update test

* update filters

* fix naming

* fix tests

* more examples

* adjust NS

* fix tic.R

* update test_FilterWrapper

* don't set length of fw.basal.methods

* add benchmark test

* fix filter names

* update tests

* get tutorial passing

* solve NS clash

* style

* NS and man

* remove purrr dep

* remove dplyr, purrr and magrittr imports

* fix tests

* remove browser() leftover

* fix coercin > 1 error

* fix brackets

* Deploy from Travis build 13854 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/540057376
Commit: 7b72274

* basal.methods -> base.methods

* Deploy from Travis build 13871 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/541796595
Commit: 3fad99c

* move redundant code into helper function

* reduce simple filters in benchmark test

* style

* we cannot tune the simple methods currently

* don't allow list specification of ensemble filters through `methods` argument

* document list notation for method arg in `generateFilterValuesData()`

* Deploy from Travis build 13907 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/544919859
Commit: 92b4c5f

* support all task and feature types for ensemble filters

* define ens.method in the function body

* update filter table

* Deploy from Travis build 13934 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/547161246
Commit: f23070f

* fix plotFilterValues()

* Deploy from Travis build 13935 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/547260271
Commit: 1f547fa

* Deploy from Travis build 13938 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/547612585
Commit: e4258f6

* revert unwanted change

* add NEWS

* Deploy from Travis build 13946 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/547668906
Commit: 23bc3cd

* remove tibble

* use data.table::melt

* fix data.table::melt

* fix caching

* fix R CMD check notes, remove unused argument from makeFilterEnsemble()

* Deploy from Travis build 13971 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/549827002
Commit: 4d4dca6

* rlang and magrittr not used anymore

* fix NS

* add info how to pass filter args when using ensemble filters

* Deploy from Travis build 13993 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/551632468
Commit: e63d863

* Deploy from Travis build 14023 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/560455073
Commit: 345da05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants