Skip to content

Doc: Adds documentation for the dataset compression argument from #1341 and #1250 #1386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 68 additions & 87 deletions autosklearn/estimators.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,58 +156,41 @@ def __init__(
'feature_preprocessor': ["no_preprocessing"]
}

resampling_strategy : Union[str, BaseCrossValidator, _RepeatedSplits, BaseShuffleSplit] = "holdout"
resampling_strategy : str | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = "holdout"
How to to handle overfitting, might need to use ``resampling_strategy_arguments``
if using ``"cv"`` based method or a Splitter object.

* **Options**
* ``"holdout"`` - Use a 67:33 (train:test) split
* ``"cv"``: perform cross validation, requires "folds" in ``resampling_strategy_arguments``
* ``"holdout-iterative-fit"`` - Same as "holdout" but iterative fit where possible
* ``"cv-iterative-fit"``: Same as "cv" but iterative fit where possible
* ``"partial-cv"``: Same as "cv" but uses intensification.
* ``BaseCrossValidator`` - any BaseCrossValidator subclass (found in scikit-learn model_selection module)
* ``_RepeatedSplits`` - any _RepeatedSplits subclass (found in scikit-learn model_selection module)
* ``BaseShuffleSplit`` - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)

If using a Splitter object that relies on the dataset retaining it's current
size and order, you will need to look at the ``dataset_compression`` argument
and ensure that ``"subsample"`` is not included in the applied compression
``"methods"`` or disable it entirely with ``False``.

**Options**

* ``"holdout"``:
67:33 (train:test) split
* ``"holdout-iterative-fit"``:
67:33 (train:test) split, iterative fit where possible
* ``"cv"``:
crossvalidation,
requires ``"folds"`` in ``resampling_strategy_arguments``
* ``"cv-iterative-fit"``:
crossvalidation,
calls iterative fit where possible,
requires ``"folds"`` in ``resampling_strategy_arguments``
* 'partial-cv':
crossvalidation with intensification,
requires ``"folds"`` in ``resampling_strategy_arguments``
* ``BaseCrossValidator`` subclass:
any BaseCrossValidator subclass (found in scikit-learn model_selection module)
* ``_RepeatedSplits`` subclass:
any _RepeatedSplits subclass (found in scikit-learn model_selection module)
* ``BaseShuffleSplit`` subclass:
any BaseShuffleSplit subclass (found in scikit-learn model_selection module)

resampling_strategy_arguments : dict, optional if 'holdout' (train_size default=0.67)
Additional arguments for resampling_strategy:

* ``train_size`` should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split.
* ``shuffle`` determines whether the data is shuffled prior to
splitting it into train and validation.

Available arguments:

* 'holdout': {'train_size': float}
* 'holdout-iterative-fit': {'train_size': float}
* 'cv': {'folds': int}
* 'cv-iterative-fit': {'folds': int}
* 'partial-cv': {'folds': int, 'shuffle': bool}
* BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments
required by chosen class as specified in scikit-learn documentation.
If arguments are not provided, scikit-learn defaults are used.
If no defaults are available, an exception is raised.
Refer to the 'n_splits' argument as 'folds'.
resampling_strategy_arguments : Optional[Dict] = None
Additional arguments for ``resampling_strategy``, this is required if
using a ``cv`` based strategy. The default arguments if left as ``None``
are:

.. code-block:: python

{
"train_size": 0.67, # The size of the training set
"shuffle": True, # Whether to shuffle before splitting data
"folds": 5 # Used in 'cv' based resampling strategies
}

If using a custom splitter class, which takes ``n_splits`` such as
`PredefinedSplit <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold>`_,
the value of ``"folds"`` will be used.

tmp_folder : string, optional (None)
folder to store configuration output and log files, if ``None``
Expand All @@ -219,12 +202,12 @@ def __init__(

n_jobs : int, optional, experimental
The number of jobs to run in parallel for ``fit()``. ``-1`` means
using all processors.
**Important notes**:
* By default, Auto-sklearn uses one core.
* Ensemble building is not affected by ``n_jobs`` but can be controlled by the number
using all processors.

**Important notes**:

* By default, Auto-sklearn uses one core.
* Ensemble building is not affected by ``n_jobs`` but can be controlled by the number
of models in the ensemble.
* ``predict()`` is not affected by ``n_jobs`` (in contrast to most scikit-learn models)
* If ``dask_client`` is ``None``, a new dask client is created.
Expand Down Expand Up @@ -288,16 +271,14 @@ def __init__(

dataset_compression: Union[bool, Mapping[str, Any]] = True
We compress datasets so that they fit into some predefined amount of memory.
Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.
Currently this does not apply to dataframes or sparse arrays, only to raw
numpy arrays.

**NOTE**

If using a custom ``resampling_strategy`` that relies on specific
**NOTE** - If using a custom ``resampling_strategy`` that relies on specific
size or ordering of data, this must be disabled to preserve these properties.

You can disable this entirely by passing ``False``.

Default configuration when left as ``True``:
You can disable this entirely by passing ``False`` or leave as the default
``True`` for configuration below.

.. code-block:: python

Expand All @@ -311,36 +292,36 @@ def __init__(

The available options are described here:

**memory_allocation**

By default, we attempt to fit the dataset into ``0.1 * memory_limit``. This
float value can be set with ``"memory_allocation": 0.1``. We also allow for
specifying absolute memory in MB, e.g. 10MB is ``"memory_allocation": 10``.

The memory used by the dataset is checked after each reduction method is
performed. If the dataset fits into the allocated memory, any further methods
listed in ``"methods"`` will not be performed.

For example, if ``methods: ["precision", "subsample"]`` and the
``"precision"`` reduction step was enough to make the dataset fit into memory,
then the ``"subsample"`` reduction step will not be performed.

**methods**

We currently provide the following methods for reducing the dataset size.
These can be provided in a list and are performed in the order as given.

* ``"precision"`` - We reduce floating point precision as follows:
* ``np.float128 -> np.float64``
* ``np.float96 -> np.float64``
* ``np.float64 -> np.float32``

* ``subsample`` - We subsample data such that it **fits directly into the
memory allocation** ``memory_allocation * memory_limit``. Therefore, this
should likely be the last method listed in ``"methods"``.
Subsampling takes into account classification labels and stratifies
accordingly. We guarantee that at least one occurrence of each label is
included in the sampled set.
* **memory_allocation**
By default, we attempt to fit the dataset into ``0.1 * memory_limit``.
This float value can be set with ``"memory_allocation": 0.1``.
We also allow for specifying absolute memory in MB, e.g. 10MB is
``"memory_allocation": 10``.

The memory used by the dataset is checked after each reduction method is
performed. If the dataset fits into the allocated memory, any further
methods listed in ``"methods"`` will not be performed.

For example, if ``methods: ["precision", "subsample"]`` and the
``"precision"`` reduction step was enough to make the dataset fit into
memory, then the ``"subsample"`` reduction step will not be performed.

* **methods**
We provide the following methods for reducing the dataset size.
These can be provided in a list and are performed in the order as given.

* ``"precision"`` - We reduce floating point precision as follows:
* ``np.float128 -> np.float64``
* ``np.float96 -> np.float64``
* ``np.float64 -> np.float32``

* ``subsample`` - We subsample data such that it **fits directly into
the memory allocation** ``memory_allocation * memory_limit``.
Therefore, this should likely be the last method listed in
``"methods"``.
Subsampling takes into account classification labels and stratifies
accordingly. We guarantee that at least one occurrence of each
label is included in the sampled set.

Attributes
----------
Expand Down
2 changes: 1 addition & 1 deletion doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@
('Start', 'index'),
('Releases', 'releases'),
('Installation', 'installation'),
#('Manual', 'manual'),
('Manual', 'manual'),
('Examples', 'examples/index'),
('API', 'api'),
('Extending', 'extending'),
Expand Down
6 changes: 6 additions & 0 deletions doc/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,12 @@ Configuring the Search Procedure

Examples for using holdout and cross-validation can be found in :ref:`example <sphx_glr_examples_40_advanced_example_resampling.py>`

If using a custom resampling strategy with predefined splits, you may need to disable
the subsampling performed with particularly large datasets or if using a small ``memory_limit``.
Please see the manual section on :ref:`limits`
:class:`AutoSklearnClassifier(dataset_compression=...) <autosklearn.classification.AutoSklearnClassifier>`.
for more details.

.. collapse:: <b>Can I use a custom metric</b>

Examples for using a custom metric can be found in :ref:`example <sphx_glr_examples_40_advanced_example_metrics.py>`
Expand Down
48 changes: 48 additions & 0 deletions doc/manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,54 @@ tested.

By default, *auto-sklearn* uses **one core**. See also :ref:`parallel` on how to configure this.


.. collapse:: <b>Managing data compression</b>

.. _manual_managing_data_compression:

Auto-sklearn will attempt to fit the dataset into 1/10th of the ``memory_limit``.
This won't happen unless your dataset is quite large or you have small a
``memory_limit``. This is done using two methods, reducing **precision** and
to **subsample**. One reason you may want to control this is if you require high
precision or you rely on predefined splits for which subsampling does not account
for.

To turn off data preprocessing:

.. code:: python

AutoSklearnClassifier(
dataset_compression = False
)

You can specify which of the methods are performed using:

.. code:: python

AutoSklearnClassifier(
dataset_compression = { "methods": ["precision", "subsample"] },
)

You can change the memory allocation for the dataset to a percentage of ``memory_limit``
or an absolute amount using:

.. code:: python

AutoSklearnClassifier(
dataset_compression = { "memory_allocation": 0.2 },
)

The default arguments are used when ``dataset_compression = True`` are:

.. code:: python

{
"memory_allocation": 0.1,
"methods": ["precision", "subsample"]
}

The full description is given at :class:`AutoSklearnClassifier(dataset_compression=...) <autosklearn.classification.AutoSklearnClassifier>`.

.. _space:

The search space
Expand Down