-
-
Notifications
You must be signed in to change notification settings - Fork 33
SLEP010 n_features_in_ attribute #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 10 commits
354a6a0
df083c4
ecff33d
08630ed
5a247e7
732dc34
f26bc32
78a0d8e
8d4ccb6
593e92c
9cee1c9
2f37147
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| .. _slep_010: | ||
|
|
||
| ===================================== | ||
| SLEP010: ``n_features_in_`` attribute | ||
| ===================================== | ||
|
|
||
| :Author: Nicolas Hug | ||
| :Status: Under review | ||
| :Type: Standards Track | ||
| :Created: 2019-11-23 | ||
|
|
||
| Abstract | ||
| ######## | ||
|
|
||
| This SLEP proposes the introduction of a public ``n_features_in_`` attribute | ||
| for most estimators (where relevant). This attribute is automatically set | ||
| when calling a new method ``BaseEstimator._validate_data(X, y=None)`` which | ||
| is meant to replace ``check_array`` and ``check_X_y`` in most cases, calling | ||
| those under the hood. | ||
|
|
||
| Motivation | ||
| ########## | ||
|
|
||
| Knowing the number of features that an estimator expects is useful for | ||
| inspection purposes, as well as for input validation. | ||
|
|
||
| Solution | ||
| ######## | ||
|
|
||
| The proposed solution is to replace most calls to ``check_array()`` or | ||
| ``check_X_y()`` by calls to a newly created private method:: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we say "private" do we mean that we do not authorise third party libraries to rely on this API?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. I added a note. |
||
|
|
||
| def _validate_data(self, X, y=None, reset=True, **check_array_params) | ||
| ... | ||
|
|
||
| The ``_validate_data()`` method will call ``check_array()`` or | ||
| ``check_X_y()`` function depending on the ``y`` parameter. | ||
|
|
||
| If the ``reset`` parameter is True (default), the method will set the | ||
| ``n_feature_in_`` attribute of the estimator, regardless of its potential | ||
| previous value. This should typically be used in ``fit()``, or in the first | ||
| ``partial_fit()`` call. Passing ``reset=False`` will not set the attribute but | ||
| instead check against it, and potentially raise an error. This should typically | ||
| be used in ``predict()`` or ``transform()``, or on subsequent calls to | ||
| ``partial_fit``. | ||
|
|
||
| In most cases, the ``n_features_in_`` attribute exists only once ``fit`` has | ||
| been called, but there are exceptions (see below). | ||
|
|
||
| A new common check is added: it makes sure that for most estimators, the | ||
| ``n_features_in_`` attribute does not exist until ``fit`` is called, and | ||
| that its value is correct. Instead of raising an exception, this check will | ||
| raise a warning for the next two releases. This will give downstream | ||
| packages some time to adjust (see considerations below). | ||
|
|
||
| The logic that is proposed here (calling a stateful method instead of a | ||
| stateless function) is a pre-requisite to fixing the dataframe column | ||
| ordering issue: with a stateless ``check_array``, there is no way to raise | ||
| an error if the column ordering of a dataframe was changed between ``fit`` | ||
| and ``predict``. | ||
|
|
||
| Considerations | ||
| ############## | ||
|
|
||
| The main consideration is that the addition of the common test means that | ||
| existing estimators in downstream libraries will not pass our test suite, | ||
| unless the estimators also have the `n_features_in_` attribute (which can be | ||
| done by updating calls to ``check_XXX()`` into calls to ``_validate_data()``). | ||
|
|
||
| The newly introduced checks will only raise a warning instead of an exception | ||
| for the next 2 releases, so this will give more time for downstream packages | ||
| to adjust. | ||
|
|
||
| Note that we have never guaranteed any kind of backward compatibility | ||
| regarding the test suite: see e.g. `#12328 | ||
|
||
| <https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680 | ||
| <https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270 | ||
| <https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new | ||
| checks. | ||
|
||
|
|
||
| There are other minor considerations: | ||
|
|
||
| - In most meta-estimators, the input validation is handled by the | ||
| sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator | ||
| is thus explicitly set to that of the sub-estimator, either via a | ||
| ``@property``, or directly in ``fit()``. | ||
| - Some estimators like the dummy estimators do not validate the input | ||
| (the 'no_validation' tag should be True). The ``n_features_in_`` attribute | ||
| should be set to None, though this is not enforced in the common tests. | ||
| - Some estimators expect a non-rectangular input: the vectorizers. These | ||
| estimators expect dicts or lists, not a ``n_samples * n_features`` matrix. | ||
| ``n_features_in_`` makes no sense here and these estimators just don't have | ||
| the attribute. | ||
| - Some estimators may know the number of input features before ``fit`` is | ||
| called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``__init__`` from the ``dictionary`` parameter. In this case the attribute | ||
| is a property and is available right after object instantiation. | ||
|
|
||
| References and Footnotes | ||
| ------------------------ | ||
|
|
||
| .. [1] Each SLEP must either be explicitly labeled as placed in the public | ||
| domain (see this SLEP as an example) or licensed under the `Open | ||
| Publication License`_. | ||
|
|
||
| .. _Open Publication License: https://www.opencontent.org/openpub/ | ||
|
|
||
|
|
||
| Copyright | ||
| --------- | ||
|
|
||
| This document has been placed in the public domain. [1]_ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,7 @@ | ||
| SLEPs under review | ||
| ================== | ||
|
|
||
| Nothing here | ||
| .. toctree:: | ||
| :maxdepth: 1 | ||
|
|
||
| slep010/proposal |
Uh oh!
There was an error while loading. Please reload this page.