Skip to content

Commit 09c4d22

Browse files
mfeurereddiebergman
authored andcommitted
Update FAQ with text stuff (#1500)
* Update FAQ with text stuff * Take suggestions into account
1 parent 56e6ac0 commit 09c4d22

File tree

2 files changed

+21
-19
lines changed

2 files changed

+21
-19
lines changed

doc/faq.rst

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -31,26 +31,30 @@ General
3131
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
3232
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
3333
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
34-
Supported formats for these training and testing pairs are: np.ndarray,
35-
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
3634

37-
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your
38-
data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
39-
for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_
40-
for multidimensional data.
41-
42-
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
35+
Regarding the features, there are multiple things to consider:
4336

4437
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
4538
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
46-
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
47-
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the
48-
column has a categorical/boolean class, it will be encoded. If the column is of any other type
49-
(Object or Timeseries), an error will be raised. For further details on how to properly encode
50-
your data, you can check the Pandas Example
51-
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_).
52-
If you are working with time series, it is recommended that you follow this approach
39+
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
40+
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
41+
supports both categorical or string as column type. Please ensure that you are using the correct
42+
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
43+
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
44+
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
45+
Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
46+
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
47+
* For further details on how to properly encode your data, you can check the Pandas Example
48+
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
5349
`Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
50+
* If you prefer not using the string option at all you can disable this option. In this case
51+
objects, strings and categorical columns are encoded as categorical.
52+
53+
.. code:: python
54+
55+
import autosklearn.classification
56+
automl = autosklearn.classification.AutoSklearnClassifier(allow_string_features=False)
57+
automl.fit(X_train, y_train)
5458
5559
Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be
5660
automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding

doc/manual.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -317,20 +317,18 @@ Other
317317
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
318318
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
319319
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
320-
Supported formats for these training and testing pairs are: np.ndarray,
321-
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
322320

323321
Regarding the features, there are multiple things to consider:
324322

325323
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
326324
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
327-
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
325+
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
328326
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
329327
supports both categorical or string as column type. Please ensure that you are using the correct
330328
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
331329
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
332330
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
333-
data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
331+
Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
334332
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
335333
* For further details on how to properly encode your data, you can check the Pandas Example
336334
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach

0 commit comments

Comments
 (0)