Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New imputation framework #75

Open
alexkowa opened this issue Nov 10, 2023 · 2 comments
Open

New imputation framework #75

alexkowa opened this issue Nov 10, 2023 · 2 comments

Comments

@alexkowa
Copy link
Member

looking at #73 and #74 maybe what we really should do is consolidating it into a new function and deprecating some functions (irmi, rangerImpute and regressionImpute) ?
@matthias-da @GregorDeCillia @JohannesGuss

# model based imputation framework function
# generally idea function that incorporates a framework for model based imputation. 
# Options should be
# - sequential modelling
# - using PMM or not
# - bootstrap the model error
# - drawing the predicted value from the "posterior" distribution
# - model options: (robust) regression, ranger, XGBoost, some kind of transformer model


vimpute <- function(data,
                    variable = colnames(data),
                    sequential = FALSE,
                    bootstrap = FALSE
                    pmm_k=NULL,# if integer value use kNN on predicted values
                    xvar = colnames(data),
                    model = c("robust","regression","ranger","XGBoost","GPT")
                    formula = NULL, # possibililty to override the indivual models
                    imp_var=TRUE,imp_suffix="imp", 
                    verbose = FALSE){
 
}
@matthias-da
Copy link
Collaborator

Yeah, it would be great to discuss this.

There are some difficulties/complexity when doing sequential imputation with different kinds of variables such as numeric and categorical. Sometimes other methods (or parametrisation) are used depending on if a variable is categorical or numeric (and in irmi we have also semi-continous and count variables considered). So, like in mice::mice there must be then default methods for different kind variables in a data set.
The argument method can then be either a single string, or a vector of strings with for each variable.

When bootstrapping one has to ensure that all categories are actually sampled in a factor variable for certain imputation methods (otherwise: error). A Bayesian Bootstrap as a way out or tricking the factor levels?

I like the idea of a robust bootstrap just because if robust then one should also care that accidentially not considerable more outliers are sampled than in the original data, but its not straightforward to be used when having a mix of different scaled variables. The idea is to divide the observations into strata depending on their "outlyingness" and sample from each strata independently.

# model based imputation framework function
# generally idea function that incorporates a framework for model based imputation. 
# Options should be
# - sequential modelling
# - using PMM or not
# - bootstrap the model error
# - drawing the predicted value from the "posterior" distribution
# - model options: (robust) regression, ranger, XGBoost, some kind of transformer model
# - complex formula (with `formula`) for each variable possible. 


vimpute <- function(data,
                    variable = colnames(data),
                    sequential = TRUE,
                    modeluncertainty = "robustBootstrap" (choices: "none", "bootstrap", "robustBootrap-stratified", "robustBootstrap-xyz", "BayesianBootstrap") 
# how to best deal with it that each method has own parameters?
                    imputationuncertainty = "PMM" (choices: "PMM", "midastouch", "normal", "residual", "pmm_k=NULL,# if integer value use kNN on predicted values. Should be either of length one or length of number of variables?
                    xvar = colnames(data), # delete this?
                    method = c("lm", "regression","ranger","XGBoost","GPT") # here I would use default methods for each kind of variable - as in mice - that one can override. Supported methods: "lm", "MM", "ranger", "XGBoost", "GPT", "gam", "robGam")
                    formula = NULL, # possibililty to override the indivual models. A list (one formula for each variable).
                    imp_var=FALSE,imp_suffix="imp", 
                    verbose = FALSE){
}

So all in all, the real pain is to pack everything in a sequential approach when variables are of different scale.

@GregorDeCillia
Copy link
Contributor

GregorDeCillia commented Nov 10, 2023

Just a technical note. Depricating the low-level functions (irmi, rangerImpute and regressionImpute) is not really necessary. We could just use the new high-level vimpute() in the docs as the "recommended way" instead. The advantage here is that the low-level functions still have their own man-pages that can go into details about how the specific algorithm works and documenting parameters that are only applicable for that specific method (using ... to pass them down from vimpute()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants