-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Fixes encoders for string dtypes #15763
FIX Fixes encoders for string dtypes #15763
Conversation
Does this imply that we should introduce similar support elsewhere, e.g.
SimpleImputer (and for categoricals in trees and Gower distances when they
come)? Should we instead aim to support one right way of doing it, and
raise a better error message?
|
Not supporting unicode and forcing users to use objects in their string processing seems a bit not user friendly. As seen in #15616 (comment), let's say we improve the error message: ohe = OneHotEncoder(categories=[["cat2", "cat1"]])
ohe.fit([["cat2"], ["cat1"]]) # Works as expected
ohe.fit(np.array([["cat2"], ["cat1"]]))
# Raises "Either pass a python list or cast numpy array to object"? In the other use case: #15726 (comment) encoder = OneHotEncoder(categories=[['mse','mae']], sparse=False)
categories = np.array([['mse'], ['mae']], dtype='object')
encoder.fit(categories)
values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
encoder.transform(values)
# Raises "Either pass a python list or cast numpy array to object"? In both cases, this seems not user friendly. From a maintainable point of view, it is most likely best to have error and only support |
I have not followed the maintainability related discussions on supporting those so far, but for a few character strings, there is a ~50bytes overhead per PyObject containing a string. So a columns of string takes 3-5x more memory as object as opposed to numpy array of dtype However I imagine most users would use pandas dataframe as input, which for now doesn't support memory efficient (e.g. numpy) string arrays pandas-dev/pandas#8640 Also using pd.Categorical does solve some of the same issues, with the added bonus of additional functionality. So maybe focusing on supporting Though the fix here is straightforward enough and should be too problematic to maintain I think. |
Maybe we just need some tools to help us generate and test each of the cases |
After speaking to @amueller, I think this is not the right fix because of how weird unicode can be: x = np.array(['a', 'b'])
x[0] = 'hello'
x
# array(['h', 'b'], dtype='<U1') In these cases we should convert the unicode dtypes to objects. |
Except if it doesn't fit in memory while array of string does. But yes string array behavior can be potentially dangerous to unaware users. |
For imputing, because of the string array behavior, it would be very unexpected to impute a string value and find that some of it is cut off. Thanking about this more, for the encoders, it never modifies the original array, so this PR should be okay. |
So if we start supporting these types, we should make it explicit in the documentation, even more, if we cannot support an array of unicode string in all estimators (data in imputers, column name in ColumnTransformer) |
Please add a what's new entry, @thomasjpfan |
@thomasjpfan , one approval already: do you think this is worth to be in 0.24? |
I think it would be nice to have because it resolves two issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@thomasjpfan Could you solve the merge conflicts as well. |
I want just to illustrate that we would not be able to use this feature in a import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import FunctionTransformer
np.random.seed(0)
# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
categorical_features = ['embarked', 'sex', 'pclass']
X = X[categorical_features].values.astype(str)
categorical_features = [0, 1, 2]
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('converter', FunctionTransformer(func=func)),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train) ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-03641e4360e7> in <module>
33 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
34
---> 35 clf.fit(X_train, y_train)
~/Documents/packages/scikit-learn/sklearn/pipeline.py in fit(self, X, y, **fit_params)
335 """
336 fit_params_steps = self._check_fit_params(**fit_params)
--> 337 Xt = self._fit(X, y, **fit_params_steps)
338 with _print_elapsed_time('Pipeline',
339 self._log_message(len(self.steps) - 1)):
~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
297 cloned_transformer = clone(transformer)
298 # Fit or load from cache the current transformer
--> 299 X, fitted_transformer = fit_transform_one_cached(
300 cloned_transformer, X, y, None,
301 message_clsname='Pipeline',
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
749 with _print_elapsed_time(message_clsname, message):
750 if hasattr(transformer, 'fit_transform'):
--> 751 res = transformer.fit_transform(X, y, **fit_params)
752 else:
753 res = transformer.fit(X, y, **fit_params).transform(X)
~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
536 self._validate_remainder(X)
537
--> 538 result = self._fit_transform(X, y, _fit_transform_one)
539
540 if not result:
~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
463 self._iter(fitted=fitted, replace_strings=True))
464 try:
--> 465 return Parallel(n_jobs=self.n_jobs)(
466 delayed(func)(
467 transformer=clone(trans) if not fitted else trans,
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1027 # remaining jobs.
1028 self._iterating = False
-> 1029 if self.dispatch_one_batch(iterator):
1030 self._iterating = self._original_iterator is not None
1031
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
845 return False
846 else:
--> 847 self._dispatch(tasks)
848 return True
849
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
763 with self._lock:
764 job_idx = len(self._jobs)
--> 765 job = self._backend.apply_async(batch, callback=cb)
766 # A job can complete so quickly than its callback is
767 # called before we get here, causing self._jobs to
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
250 # change the default number of processes to -1
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252 return [func(*args, **kwargs)
253 for func, args, kwargs in self.items]
254
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
250 # change the default number of processes to -1
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252 return [func(*args, **kwargs)
253 for func, args, kwargs in self.items]
254
~/Documents/packages/scikit-learn/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
220 def __call__(self, *args, **kwargs):
221 with config_context(**self.config):
--> 222 return self.function(*args, **kwargs)
~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
749 with _print_elapsed_time(message_clsname, message):
750 if hasattr(transformer, 'fit_transform'):
--> 751 res = transformer.fit_transform(X, y, **fit_params)
752 else:
753 res = transformer.fit(X, y, **fit_params).transform(X)
~/Documents/packages/scikit-learn/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
372 """
373 fit_params_steps = self._check_fit_params(**fit_params)
--> 374 Xt = self._fit(X, y, **fit_params_steps)
375
376 last_step = self._final_estimator
~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
297 cloned_transformer = clone(transformer)
298 # Fit or load from cache the current transformer
--> 299 X, fitted_transformer = fit_transform_one_cached(
300 cloned_transformer, X, y, None,
301 message_clsname='Pipeline',
~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
749 with _print_elapsed_time(message_clsname, message):
750 if hasattr(transformer, 'fit_transform'):
--> 751 res = transformer.fit_transform(X, y, **fit_params)
752 else:
753 res = transformer.fit(X, y, **fit_params).transform(X)
~/Documents/packages/scikit-learn/sklearn/base.py in fit_transform(self, X, y, **fit_params)
718 else:
719 # fit method of arity 2 (supervised transformation)
--> 720 return self.fit(X, y, **fit_params).transform(X)
721
722
~/Documents/packages/scikit-learn/sklearn/impute/_base.py in fit(self, X, y)
282 self : SimpleImputer
283 """
--> 284 X = self._validate_input(X, in_fit=True)
285
286 # default fill_value is 0 for numerical input and "missing_value"
~/Documents/packages/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
260 _check_inputs_dtype(X, self.missing_values)
261 if X.dtype.kind not in ("i", "u", "f", "O"):
--> 262 raise ValueError("SimpleImputer does not support data with dtype "
263 "{0}. Please provide either a numeric array (with"
264 " a floating point or integer dtype) or "
ValueError: SimpleImputer does not support data with dtype <U6. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype. |
So it might be worth to merge this but we need to make all estimators supporting this dtype. |
@rth said above:
Actually, the numpy array does not store a Python object but a reference to a Python object. For categorical variables where there is a lot of repetition this will typically be only 8 bytes per entry (+ one PyObject per unique category, outside of the array itself). This is larger than >>> import numpy as np
>>> X_obj = np.tile(np.array(["a", "b", "c"], dtype=object), int(1e6))
>>> X_obj.nbytes / 1e6
24.0
>>> X_obj.itemsize
8
>>> X_str = X_obj.astype(str)
>>> X_str.dtype
dtype('<U1')
>>> X_str.nbytes / 1e6
12.0
>>> X_str.itemsize
4 So as long as the cardinality is limited, using dtype=object for categorical variables is almost always more memory efficient than unicode (or even bytes) string dtypes. However for free text data (e.g. first name / last names, street names, titles, descriptions...) where the entries are rarely or never repeated, then the unicode string might make sense as long as the length is homogeneous. If the length of one entry is 10x larger than the median length, then the padding incurred by the unicode string dtypes will also blow up the memory usage and object dtype would again be more memory efficient. The only practical case where the numpy unicode / bytes string dtype could be valid would be to store fixed length string identifiers or hash digests in hexadecimal notation (as the hash size is typically larger than 64 bits). But those are typically useless variables in a machine learning context (because of the lack of repetition). pandas 1.0 has an experimental variable length That being said we could still change |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should definitely include a fix for the very confusing #15726. As unicode strings were not explicitly rejected in 0.23, let us do this quickfix which is a net improvement, even if the full scikit-learn code base (e.g. ColumnTransfomer
) does not accept unicode strings consistently.
We can always decide later whether we want to add support for unicode string support to ColumnTransfomer
for consistency or deprecate unicode string support in categorical values encoders in light of my comment above on dtypes. If we want better memory efficiency we might consider a deeper integration with pandas categorical but that sounds like a potentially significant maintenance effort.
I will fix the linting failure I caused by resolving the conflicts using the online github UI... |
There is a more fundamental problem in my merge commit. I am investigating. |
Merged. Thanks all! |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Fixes #15616
Fixes #15726
What does this implement/fix? Explain your changes.
Adds support for string dtypes in the
Encoders
.Any other comments?
This PR opens up
'U'
dtypes to be encoded the same way as'O'
dtypes.