-
-
Notifications
You must be signed in to change notification settings - Fork 405
Filter: Add ensemble filter methods #2456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
R/FilterWrapper.R
Outdated
@@ -26,12 +26,18 @@ | |||
#' Mutually exclusive with arguments `fw.perc` and `fw.abs`. | |||
#' @param fw.mandatory.feat ([character])\cr | |||
#' Mandatory features which are always included regardless of their scores | |||
#' @param ensemble.method ([character])\cr | |||
#' Which ensemble method should be used. Can only be used with >= 2 filter methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly does this work? You can only specify one method in the wrapper, can't you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why is this a character? Comments and code below suggest that this is a logical value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can use multiple methods.
Also why is this a character? Comments and code below suggest that this is a logical value.
I'll check again. But as said, you can use multiple ones.
R/generateFilterValues.R
Outdated
#' @template arg_task | ||
#' @param method ([character])\cr | ||
#' Filter method(s), see above. | ||
#' Default is \dQuote{randomForestSRC.rfsrc}. | ||
#' @param nselect (`integer(1)`)\cr | ||
#' Number of scores to request. Scores are getting calculated for all features per default. | ||
#' @param ensemble.method ([character])\cr | ||
#' Ensemble filter method to use. Can only be used with >= 2 filter methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be consistent with wrapper -- character or logical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing is finished yet :)
R/generateFilterValues.R
Outdated
|
||
### ensemble rank aggregation | ||
|
||
if (any(c("E-min", "E-mean", "E-median", "E-max", "E-Borda") %in% ensemble.method)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible values for ensemble method should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I will ofc do this.
|
||
test_that("ensemble methods work", { | ||
fi = generateFilterValuesData(multiclass.task, method = c('gain.ratio','information.gain'), | ||
ensemble.method = c("E-Borda", "E-min")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean if multiple ensemble methods are specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as if you would use multiple single ones. You get back a DF with all listed rankings when you use generateFilterValuesData()
.
Instead of putting lots of special code into the generateFilterValuesData function the ensemble stuff should probably happen somewhere else. My suggestion is that the filter code should be changed to also accept functions or filterFeatures(pid.task, "univariate.model.score", abs = 3,
perf.learner = "classif.logreg")
# (there should probably be a better way to access this than by ":::")
filterFeatures(pid.task, mlr:::.FilterRegister$univariate.model.score, abs = 3,
perf.learner = "classif.logreg")
filterFeatures(pid.task,
makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
abs = 3,
univariate.model.score.perf.learn = "classif.logreg")
# alternative:
filterFeatures(pid.task,
makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
filter.args = list(univariate.model.score = list(perf.learn = "classif.logreg")),
abs = 3) In these examples, the |
Hi guys, This is all WIP here. The main idea is to use them in |
@mb706 Thanks for your input. Sounds like a good idea. I prefer the following notation: filterFeatures(pid.task,
makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
abs = 3,
univariate.model.score.perf.learn = "classif.logreg") |
Doc, tests and more concrete examples in the next days. |
@larskotthoff @mb706 |
Looks like builds are failing... |
Tests etc are still missing. It's more about the general approach before fixing all the details and then changing everything again. Would be great if we could talk about the "big picture" 🙂 |
@@ -1,5 +1,9 @@ | |||
# mlr 2.14.0.9000 | |||
|
|||
## Breaking | |||
|
|||
- Instead of a wide `data.frame` filter values are now returned in a long (tidy) `tibble`. This makes it easier to apply post-processing methods (like `group_by()`, etc) (@pat-s, #2456) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could imagine that this is actually something more people programmed against. So a breaking change here might irritate some people. Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.
Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could imagine that this is actually something more people programmed against.
This refers to returning a long DF?
Changing this again would involve several hours since I have to re-arrange all outputs..
Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?
Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.
Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.
Yes, sure. For now it is only used for printing, not internally. (i.e. the DF is coerced right before its returned). Which ofc makes no difference for the Import of tibble. I just hate it to print a DF that fills my console to Inf...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to returning a long DF?
Yes
Changing this again would involve several hours since I have to re-arrange all outputs..
Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.
Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.
Can you point me to it?
I just hate it to print a DF that fills my console to Inf...
Then just add the following to your .Rprofile
No need to add a whole package to the dependencies.
if (interactive() && "tibble" %in% rownames(utils::installed.packages())) {
print.data.frame = function(x, ...) {
tibble:::print.tbl(tibble::as_tibble(x), ...)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.
It will break code since the structure of the returned filter value is different, yes.
It is something that could make a reverse dependency check worth it before going on cran.
Yes, I always do that.
Then just add the following to your .Rprofile No need to add a whole package to the dependencies.
Nice hack. I'll use it :) - and get rid of using tibble then in the package.
Can you point me to it?
Line 142 in edf0142
out = tidyr::gather(out, method, "value", !!dplyr::enquo(method)) |
I tried a lot of non-dplyr stuff here but eventually gave up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it using melt from data.table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
so easy, damn...
I assume that the ensemble filters are not using caching since I see long runtimes for them in my study. The aggregation step cannot cause this so I assume the simple filters are not being taken from the cache. Have to inspect. |
I made a mistake during merging (most likely) - caching was not used so far in this PR because the memoized function was not used. See 049969f. Fixed it now. I was wondering heavily why everything took so long in my project.. 🙄 🤦♂️ |
@larskotthoff @jakob-r @jakob-r If you approve your review, feel free to merge. |
DESCRIPTION
Outdated
@@ -207,6 +208,7 @@ Suggests: | |||
LiblineaR, | |||
lintr (>= 1.0.0.9001), | |||
MASS, | |||
magrittr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed.
DESCRIPTION
Outdated
@@ -147,6 +147,7 @@ Imports: | |||
ggplot2, | |||
methods, | |||
parallelMap (>= 1.3), | |||
rlang, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is .data
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover 👍 See commit.
@jakob-r Mergeable now? |
Merging now. |
* wip * save state * fs ensemble working * add fs ens test * update plotFilterValues() * new approach * add example to filterFeatures * tuneParams() adjustments * tests for generateFilterValuesData() and makefilterWrapper() * allow only passing basal.methods * add tuneParams() checks for ensemble filters * import dplyr and magrittr * no cache * suggests: tidyr * solve dplyr::filter NS clash * trying to fix "export of global variables" note * example indentation * clean example * fix docs * fix global variables export error, add param doc * update generateFilterValuesData tests * indent * account for nselect = 0 * remove getFilterValues test * try using SE in tidyr::gather * fix global variables warning * next SE attempt * update tests * update test * update filters * fix naming * fix tests * more examples * adjust NS * fix tic.R * update test_FilterWrapper * don't set length of fw.basal.methods * add benchmark test * fix filter names * update tests * get tutorial passing * solve NS clash * style * NS and man * remove purrr dep * remove dplyr, purrr and magrittr imports * fix tests * remove browser() leftover * fix coercin > 1 error * fix brackets * Deploy from Travis build 13854 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/540057376 Commit: 7b72274 * basal.methods -> base.methods * Deploy from Travis build 13871 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/541796595 Commit: 3fad99c * move redundant code into helper function * reduce simple filters in benchmark test * style * we cannot tune the simple methods currently * don't allow list specification of ensemble filters through `methods` argument * document list notation for method arg in `generateFilterValuesData()` * Deploy from Travis build 13907 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/544919859 Commit: 92b4c5f * support all task and feature types for ensemble filters * define ens.method in the function body * update filter table * Deploy from Travis build 13934 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547161246 Commit: f23070f * fix plotFilterValues() * Deploy from Travis build 13935 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547260271 Commit: 1f547fa * Deploy from Travis build 13938 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547612585 Commit: e4258f6 * revert unwanted change * add NEWS * Deploy from Travis build 13946 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/547668906 Commit: 23bc3cd * remove tibble * use data.table::melt * fix data.table::melt * fix caching * fix R CMD check notes, remove unused argument from makeFilterEnsemble() * Deploy from Travis build 13971 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/549827002 Commit: 4d4dca6 * rlang and magrittr not used anymore * fix NS * add info how to pass filter args when using ensemble filters * Deploy from Travis build 13993 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/551632468 Commit: e63d863 * Deploy from Travis build 14023 [ci skip] Build URL: https://travis-ci.org/mlr-org/mlr/builds/560455073 Commit: 345da05
Purpose
In the filtering/feature selection field, the idea of ensemble filter methods becomes more and more popular. Ensemble filters aggregate the rankings of multiple single filters and create a new ranking.
This approach has shown to be superior to single filter methods, e.g. https://ieeexplore.ieee.org/document/8250495.
Implementation
I decided to establish a new class "FilterEnsemble" which distinguishes the ensemble filter normal the single filters. This decision has (as always) positive and negative side-effects.
Ensemble filters are created in the same way as normal filters in their own file,
R/FilterEnsemble.R
.They share the same class structure with some minor differences:
pkg
,supported.tasks
,supported.features
arguments (all checked by the simple filters)basal.methods
which stands for "single filter methods"Calculation is done as usual via
generateFilterValuesData()
orfilterFeatures()
:FilterValues
object is created as usual by callinggenerateFilterValuesData()
with the single filters.FilterValues
object (e.g. taking the mean across all voters for each feature).Notation
Notation differs a bit among the functions.
In
generateFilterValuesData()
, an ensemble method is passed in a list together with its required simple methods, .e.g.:To make
makeFilterWrapper()
flexible in the sense that the single methods, which an ensemble method uses, should be tunable, a new argumentbase.methods
was introduced. It depends on a ensemble method set either infilterFeatures(method = "")
or inmakefilterWrapper(fw.method = "")
.This gives the user the option to tune
fw.method
fw.method
Tuning simple filters is not supported due to the lack of sampling without replacement for
DiscreteVectorParams
in ParamHelpers mlr-org/ParamHelpers#206As multiple rankings are calculated and returned when using an ensemble filter,
filterFeatures()
will always prioritize the ensemble method unless a different method is set via the newselect.method
argument.This only applies if
filterFeatures()
is called directly as in the wrapper only one filter method is for subsetting anyway (and in the ensemble case, the prioritizing of the ensemble method applies).Other changes
getFilterValuesData()
now returns a tbl instead of a data.frame (I think there is no reason not to use enhanced data.frame output. I does not harm any internal processes.)plotFilterValues()
got a bit "smarter" and easier now regarding the ordering of multiple facetsfilterFeatures()
,generateFilterValuesData()
andmakeFilterWrapper()
group_by()
calls etc)To-do
tests
Cache filterValues in a tuning process (and don't recalculate them all the time) Tuning with filters recalculates filter values each iteration #1995
Examples
Created on 2018-10-22 by the reprex package (v0.2.1)