Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to "flatten" backends #942

Open
mb706 opened this issue Jul 13, 2023 · 0 comments
Open

option to "flatten" backends #942

mb706 opened this issue Jul 13, 2023 · 0 comments
Assignees

Comments

@mb706
Copy link
Collaborator

mb706 commented Jul 13, 2023

Apparently we get some overhead when mlr3pipelines builds tasks with many BackendCbinds. One way to fix this would be if there were an option to "flatten" cbinded tasks.
Suggested interface:

Task$flatten(force = FALSE)  # default

creates a task with a single BackendDataTable, unless this is for some reason a bad idea, e.g. when a backend is a database backend. (A Backend class would need to report whether flattening is a "bad idea", possibly with an active binding, e.g. a database backend could say flattening is okay if the size is less than X MB)

Setting force = TRUE should OTOH flatten the task always, equivalent to creating a new task with the task$data().

Example: TaskClassif that consists of two cbinded data.tables that were cbinded with a database backend:
(abbreviating (DataBackend as DB)

                TaskClassif
                  |
               DBCbind
              /       \
         DBCbind      DBDataBase
        /       \ 
 DBDataTable DBDataTable

$flatten(force = FALSE):

                TaskClassif
                  |
               DBCbind
              /       \
     DBDataTable      DBDataBase

$flatten(force = TRUE):

               TaskClassif
                  |
               DBDataTable

We could think whether it is a good idea if mlr3pipelines does this with all its output tasks by default.

Another question is whether that should be an in-place operation that swaps out a task's data backend, or whether this should create a new task.

Another question is what to do with columns that do not have any column role. Maybe a good default would be to drop backends that do not provide columns that have a role (and are therefore ignored in many cases).

Maybe we would want to have a DataBackendMultiCBind that can cbind multiple sources, so even a task that has many different database backends will only be one level deep at the most after flattening. The $flatten(force = FALSE) -operation would have to check, for each column, if it comes from a data backend that reports it does not want to be flattened. There should be a method in DataBackend that does this recursively. $flatten() would then construct the desired DataBackendMultiCBind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant