Description
Describe the bug
I'm calling some of the regression methods provided in auto-sklearn for my project and the error shows when using mlp/libsvm_svr/sgd, the exact error message is (omitted the returned 1D array):
~/anaconda3/lib/python3.8/site-packages/autosklearn/pipeline/components/regression/libsvm_svr.py in predict(self, X)
100 raise NotImplementedError
101 Y_pred = self.estimator.predict(X)
--> 102 return self.scaler.inverse_transform(Y_pred)
103
104 @staticmethod
~/anaconda3/lib/python3.8/site-packages/sklearn/preprocessing/_data.py in inverse_transform(self, X, copy)
1014
1015 copy = copy if copy is not None else self.copy
-> 1016 X = check_array(
1017 X,
1018 accept_sparse="csr",
~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
759 # If input is 1D raise error
760 if array.ndim == 1:
--> 761 raise ValueError(
762 "Expected 2D array, got 1D array instead:\narray={}.\n"
763 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
for autosklearn/pipeline/components/regression/mlp.py
, autosklearn/pipeline/components/regression/libsvm_svr.py
and autosklearn/pipeline/components/regression/sgd.py
To Reproduce
Test data: https://www.kaggle.com/tejashvi14/medical-insurance-premium-prediction/download
Using "PremiumPrice" as response/y and other variables as features/X
- Call above three models with fit, predict workflow. The above message will appears at predict stage.
- Or, I tried using AutoSklearnRegressor
Fit stage (Time limit just to save time, I don't expect it can return anything meaningful.)
from autosklearn.regression import AutoSklearnRegressor
reg = AutoSklearnRegressor(
time_left_for_this_task = 360,
include = {'regressor' : ['mlp']}
)
reg.fit(data[features], data[[response]])
Predict Stage
reg.predict(data[features], data[[response]])
The training stage will return enormous amount of [WARNING] [2021-11-09 15:14:31,628:Client-AutoMLSMBO(1)::079213e7-41a2-11ec-97c8-00155d1712a6] Configuration 119 not found
(with different numbers at 119 position).
And for AutoSklearnRegressor, predict will just return a (n_sample, ) numpy array with all same elements (close to mean of response but not exact the same), which I don't think is completed as intended.
Returns of the test predict stage (only taken first few lines, others are just the same)
array([24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
Reason for the Problem
I think the problem is caused by standardization (sklearn.preprocessing.StandardScaler
) used in autosklearn/pipeline/components/regression/mlp.py
, autosklearn/pipeline/components/regression/libsvm_svr.py
and autosklearn/pipeline/components/regression/sgd.py
Code below extracted from autosklearn/pipeline/components/regression/sgd.py
, iterative_fit, line 92-95
self.scaler = sklearn.preprocessing.StandardScaler(copy=True)
self.scaler.fit(y.reshape((-1, 1)))
Y_scaled = self.scaler.transform(y.reshape((-1, 1))).ravel()
self.estimator.fit(X, Y_scaled)
And in predict method, line 131-132
Y_pred = self.estimator.predict(X)
return self.scaler.inverse_transform(Y_pred)
Y_pred is returned by predict method, a (n_sample, ) numpy array, while the inverse_transform of StandardScaler requires a (n_sample, 1) array. Correction should be something like:
Y_pred = self.estimator.predict(X)
return self.scaler.inverse_transform(Y_pred.reshape(-1, 1)).ravel()
I think mlp/libsvm_svr have the same problem.
Environment and installation:
- OS: Windows 11 Education, OS build 22000.282, WSL version 2 with Ubuntu 20.04.3 LTS (run on WSL)
- Conda version: 4.10.3
- Python version: 3.8.8
- Sklearn version: 1.0.1
- Auto-sklearn version: 0.14.0