Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add developer API post #198

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions _posts/2024-12-05-dev-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
#### Blog Post Template ####

#### Post Information ####
title: "Changes and development of scikit-learn's developer API"
date: December 12, 2024

#### Post Category and Tags ####
# Format in titlecase without dashes (Ex. "Open Source" instead of "open-source")
categories:
- Updates
tags:
- Open Source
- Machine Learning
- License

#### Featured Image ####
featured-image: BSD_watermark.svg

#### Author Info ####
# Can accomodate multiple authors
# Add SQUARE Author Image to /assets/images/author_images/ folder
postauthors:
- name: Adrin Jalali
website: https://adrin.info/
image: adrin-jalali.jpeg
---
<div>
<img src="/assets/images/posts_images/{{ page.featured-image }}" alt="">
{% include postauthor.html %}
</div>

Historically, scikit-learn's API has been divided into public and private. Public API is
intended to be used by users, and private API is used internally in scikit-learn to
develop new features and estimators. However, many of those functionalities have become
essential to develop scikit-learn estimators by third parties who develop them outside
the scikit-learn codebase.

When it comes to our public API, we have very strict and high standards on backward
compatibility. The rule of thumb is that no change should cause a change in users'
code unless we warn about it for two release cycles, which means we give users a year
time to update their code.

On the other hand, we have no such guarantees or constraints on our private API. This
brings an issue to third party developers who would like to use methods used by
scikit-learn developers to develop their estimators. Constantly changing private API
without prior warning brings certain challenges to third party developers which is not
ideal.

As a result, we've been working on creating a developer API which would sit somewhere
between our public and private API in terms of backward compatibility. That means we
intend to try to keep that API stable, and if needed, introduce changes with one release
cycle warning.

In the past few releases, we've slowly introduced more functionalities under this
umbrella. `__sklearn_clone__` and `__sklearn_is_fitted__` are two examples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to tell what is part of this developer API? If yes we could mention it here. If no, we could create it at some point later (not needed for this blog post)


In the latest release, at the time of writing this post, we focused on the testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not sure why i cant suggest a change directly)

Should we just say "In the 1.6 release, we focussed on ..."?

infrasutructure and estimator tag system. Estimator tags used to be private, and we
were not sure about their design. In the 1.6 release, new tags are introduced and
using them looks like the following:

```python
from sklearn.base import BaseEstimator, ClassifierMixin

class MyEstimator(ClassifierMixin, BaseEstimator):

...

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
# modify tags here
tags.non_deterministic = True
return tags
```

The new tags mostly follow the same structure as the old tags, but there are certain
changes to them. The main change is that the old `_xfail_checks` is no more present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is no more present" -> "is no longer present"

in the new tags. That tag was used to tell the common testing tools about the tests
which are known to fail and are to be skipped. That information is now directly passed
to the test functionalities. The old way of skipping a test was the following:

```python
from sklearn.base import BaseEstimator, ClassifierMixin

class MyEstimator(ClassifierMixin, BaseEstimator):

...

def _more_tags(self):
return {
"_xfail_checks": {
"check_to_skip_name": "this check is known to fail",
...
}
}
```

And then when calling `check_estimator` or using `parametrize_with_checks` with `pytest`
would automatically ignore those tests for the estimator.

Instead, in this release, you pass that information directly to those methods:

```python
from sklearn.utils.estimator_checks import check_estimator, parametrize_with_checks

CHECKS_EXPECTED_TO_FAIL = {
"check_to_skip_name": "this check is known to fail",
...
}

# Using check_estimator
def test_with_check_estimator():
check_estimator(MyEstimator(), expected_failed_checks=CHECKS_EXPECTED_TO_FAIL)

# Using parametrize_with_checks
@parametrize_with_checks(
[MyEstimator()],
expected_failed_checks=lambda est: CHECKS_EXPECTED_TO_FAIL
)
def test_with_parametrize_with_checks(estimator, check):
check(estimator)
```

While working on the testing infrastructure, we have also been working on improving our
tests and that means in this release we had a particularly higher number of changes in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tests and that means in this release we had a particularly higher number of changes in
tests and that means in this release we had a particularly high number of changes in

their names and what they do. The changes should have made it easier for developers to
Copy link
Member

@betatim betatim Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The changes should have" -> "The changes will make it easier for"

"should have" somehow sounds like it was a failed attempt. "Increasing prices should have reduced demand for ice cream, however queues are longer than ever"

fix issues with their estimators. Note that you can now pass `legacy=False` to both
`check_estimator` and `parametrize_with_checks` to include only strictly API related
tests.

The above changes means developers need to updated their estimators and depending on
what they use, write scikit-learn version specific code to handle supporting multiple
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
scikit-learn versions. To make that process easier, we've worked on a package called
[`sklearn_compat`](https://github.com/sklearn-compat/sklearn-compat/). You can either
depend on it as a package dependency, or vendor a single file inside your project. At
the moment this project is in its infancy and might change in the future. But hopefully
it helps developers out there.
glemaitre marked this conversation as resolved.
Show resolved Hide resolved

If you think there are missing functionalities in the developer API, please let us know
and give us feedback on your [issue tracker](
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
https://github.com/scikit-learn/scikit-learn/issues).