You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/faq.rst
+19-15Lines changed: 19 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -31,26 +31,30 @@ General
31
31
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
32
32
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
33
33
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
34
-
Supported formats for these training and testing pairs are: np.ndarray,
35
-
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
36
34
37
-
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your
38
-
data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
39
-
for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_
40
-
for multidimensional data.
41
-
42
-
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
35
+
Regarding the features, there are multiple things to consider:
43
36
44
37
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
45
38
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
46
-
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
47
-
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the
48
-
column has a categorical/boolean class, it will be encoded. If the column is of any other type
49
-
(Object or Timeseries), an error will be raised. For further details on how to properly encode
50
-
your data, you can check the Pandas Example
51
-
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_).
52
-
If you are working with time series, it is recommended that you follow this approach
39
+
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
40
+
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
41
+
supports both categorical or string as column type. Please ensure that you are using the correct
42
+
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
43
+
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
44
+
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
45
+
Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
46
+
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
47
+
* For further details on how to properly encode your data, you can check the Pandas Example
48
+
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
53
49
`Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
50
+
* If you prefer not using the string option at all you can disable this option. In this case
51
+
objects, strings and categorical columns are encoded as categorical.
Copy file name to clipboardExpand all lines: doc/manual.rst
+2-4Lines changed: 2 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -317,20 +317,18 @@ Other
317
317
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
318
318
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
319
319
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
320
-
Supported formats for these training and testing pairs are: np.ndarray,
321
-
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
322
320
323
321
Regarding the features, there are multiple things to consider:
324
322
325
323
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
326
324
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
327
-
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
325
+
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
328
326
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
329
327
supports both categorical or string as column type. Please ensure that you are using the correct
330
328
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
331
329
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
332
330
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
333
-
data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
331
+
Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
334
332
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
335
333
* For further details on how to properly encode your data, you can check the Pandas Example
336
334
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
0 commit comments