Skip to content

Hyperimpute length mismatch #41

@preritt

Description

@preritt

Question

Length mismatch error

Further Information

I am trying to use hyperimpute on my custom data. I am using the following setup:

method = "hyperimpute"
plugin = Imputers().get(method,
                        optimizer = "hyperband",
                           classifier_seed=["logistic_regression", "catboost", "xgboost", "random_forest"],
                            regression_seed=[
                                "linear_regression",
                                "catboost_regressor",
                                "xgboost_regressor",
                                "random_forest_regressor",
                            ], 
                                # class_threshold: int. how many max unique items must be in the column to be is associated with categorical
                            class_threshold=5,
                            # imputation_order: int. 0 - ascending, 1 - descending, 2 - random
                            imputation_order=2,
                            # n_inner_iter: int. number of imputation iterations
                            n_inner_iter=10,
                            # select_model_by_column: bool. If true, select a different model for each column. Else, it reuses the model chosen for the first column.
                            select_model_by_column=True,
                            # select_model_by_iteration: bool. If true, selects new models for each iteration. Else, it reuses the models chosen in the first iteration.
                            select_model_by_iteration=True,
                            # select_lazy: bool. If false, starts the optimizer on every column unless other restrictions apply. Else, if for the current iteration there is a trend(at least to columns of the same type got the same model from the optimizer), it reuses the same model class for all the columns without starting the optimizer.
                            select_lazy=True,
                            # select_patience: int. How many iterations without objective function improvement to wait.
                            select_patience=5,
                            )
# fit it on the data
plugin.fit(traindataSelected.copy())
# predict the missing values
predictedval = plugin.transform(traindataSelected.copy())

My train data has 1000 rows and 372 columns. When I run, I get the following error:

---> [78] predictedval = plugin.transform(traindataSelected.copy())

ValueError: Length mismatch: Expected axis has 368 elements, new values have 372 elements

Can you please let me know if I am missing something or the reason for the error? Is there a way to manually specify which columns should be considered continuous and which ones should be treated as discrete?

Even when I use mean imputer, my predicted data is 368 columns while my original data has 372 columns.

method = "mean"
plugin = Imputers().get(method)
# fit it on the data
plugin.fit(X.copy())
# predict the missing values
predictedval = plugin.transform(X.copy())

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions