-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PipeOpFilterRows #410
base: master
Are you sure you want to change the base?
PipeOpFilterRows #410
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Also the need to filter rows during training (e.g. based on NA or some other property that a learner cannot handle) may be very different from this splitting up predictions between learners thing? |
Cleaned this up. Currently has two parameters:
I agree with @mb706 that this is somewhat redundant, because:
results in virtually the same operation being done. Maybe we should just split this up into |
That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the
So then dropping rows where any feature is SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any) |
So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to? Suppose that is not a problem, what is the current state of Also, maybe this |
I like the idea of using |
I think we are probably mixing up different use cases. I believe what @pfistfl wanted to be possible in #301 is to simple split your task based on some logic both during training and prediction (with the row ids of the splits being disjunct and their union being equal to the row ids of the complete task). In this case using
Then we would only need another
|
Codecov Report
@@ Coverage Diff @@
## master #410 +/- ##
==========================================
- Coverage 94.46% 93.54% -0.93%
==========================================
Files 77 72 -5
Lines 2528 2245 -283
==========================================
- Hits 2388 2100 -288
- Misses 140 145 +5
Continue to review full report at Codecov.
|
Hello, I thought this new feature might help my issue #566. I gave it a try using this commit and then adding the line |
Hi @py9mrg, your approach above does not work out of the box because the selector_all_inc_target = function() {
make_selector(function(task) {
c(task$feature_names, task$target_names)
}, description = "selector_all_inc_target")
} then use this in your pipeop: po = po("filterrows",
param_vals = list(
filter_formula = ~ !apply(is.na(.SD), 1, any),
SDcols = selector_all_inc_target()
)
) this would result in the 3rd row also being dropped: po$train(list(task_w_missing))[[1]]$data()
All in all this should work for you. |
Hello @sumny , Thanks so much for coming back to me. That works! Just one thing in case anyone is reading with a similar use case, because Thanks again for your help. |
Hello @sumny, do you know when this PR might be merged? I'm using it all the time and it would be nice, when not using renv, to not have to keep rolling mlr3pipelines back to this commit every time I forget to untick the update box in RStudio for this particular package! Edit: oh I forgot to say, it's probably me, but I can only get the 84d7341 commit to recognise the new filterrows po. I can see it getting the html documentation when installing the other commits as |
@sumny thanks and don't worry - I'm sure I am completely underestimating how significant the change is. |
PipeOpFilterRows
allows to filter the rows of the data of a task according to a filter parameter and adds an easy way for removing rows with missing values, related to #301dropped this from the PRPipeOpPredictionUnion
allows to combine predictions of the sametask_type
andpredict_type
into a newPrediction
. This maybe needs some discussion as combining predictions is imo not that straightforward.