Add `as_sklearn` and `from_sklearn` APIs to serialize to CPU sklearn-estimators for supported models #6102

dantegd · 2024-10-08T02:21:08Z

No description provided.

…support it

betatim · 2024-11-19T13:50:40Z

Why have the methods do both the conversion cuml<>sklearn and the serialisation? Having a way to convert to and from scikit-learn seems like a useful thing by itself. Maybe because you have your own way of serialising the model, or because you need a particular type of model or who-knows-what.

So to serialise it you'd do something like pickle.dumps(cuml_est.to_sklearn()) (or dill, joblib, ...)

How hard would it be to have cuml.from_sklearn(estimator)? As in one top level function that takes a scikit-learn estimator and converts it to the cuml equivalent? It seems like it should be easy to figure out the estimator's class name: "just look at .__class__.__name__" but I wonder if there is a trap here?

Name bike shedding: if we don't save things to a file, how about as_sklearn? A bit like other functions that do type conversion like astype.

If we do save to a file, then save_sklearn and load_sklearn? Basically getting words like "save" and "load" in there to make it clear that this is about storing things (to a file).

copy-pr-bot · 2024-12-19T23:56:48Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

dantegd · 2024-12-20T00:01:41Z

@betatim just implemented your suggestions and changed the functionality to not save a file but return an estimator.

I think the idea of cuml.from_sklearn is fantastic, but requires some additional logic and testing since it requires to also validate the estimator requested has the functionality needed, so perhaps it can be a follow up to keep this PR small and succint?

viclafargue

LGTM, just have two comments

viclafargue · 2024-12-20T10:11:01Z

python/cuml/cuml/internals/base.pyx

+        return self._cpu_model
+
+    @classmethod
+    def from_sklearn(cls, model):


Would be great to have a global conversion table, so that we don't need to provide the class as a parameter.

This is a class method, so we get the class from that, it's not something the user passes (like self in non class methods)

A global conversion table will be useful for a follow up to add cuml.from_sklearn library type of functionality though

viclafargue · 2024-12-20T10:12:37Z

python/cuml/cuml/internals/base.pyx

+        """
+        estimator = cls()
+        estimator.import_cpu_model()
+        estimator._cpu_model = model


It could be interesting to add an optional parameter to this function to allow a deepcopy of the sklearn model.

lol I asked the same thing before reading this suggestion :)

Rather than adding parameters to control deep-copying, why not implement explicit (deep-/)copying logic on the class and thus enable the user to perform the copies when desired? That would follow "Explicit is better than implicit" and "There should be one-- and preferably only one --obvious way to do it."

Example:

from copy import deepcopy as dc accel_est = dc(cpu_estimator.from_sklearn())

Supporting copies explicitly in a controlled manner would be a good idea anyways, no?

I like Simon's suggestion. I think having a parameter is not so useful, if the above code snippet works. Then those who know they need to deepcopy can do that

viclafargue · 2024-12-20T10:15:20Z

python/cuml/cuml/internals/base.pyx

+        self.import_cpu_model()
+        self.build_cpu_model()
+        self.gpu_to_cpu()
+        return self._cpu_model


Should we not return a deep copy here?

For my education, why would we want to deepcopy? Mostly asking because in my experience 99% of cases where someone uses deepcopy there is something else that we can do instead or just not do it. Mostly Python "just works" without deepcopy'ing, hence my interest

If we simply return a reference to the internal model, any modification to one (additional training or something else) would affect the other. This might create a situation in which the CPU and GPU attributes are out of sync in the cuML estimator. Or inversely, the sklearn estimator returned by the function might silently be updated by the cuML estimator.

perhaps it should be a parameter, by default most users probably won't care about needing a deep copy, so I wouldn't do it by default, but if a user needs it then they can request it, what do you guys think?

I'd advise against adding such parameter and instead expect users to explicitly create a deep copy when desired. See also my other comment on this.

betatim · 2024-12-20T13:54:49Z

python/cuml/cuml/experimental/accel/__main__.py

+            else:
+                raise ValueError(f"Serializer {format} not supported.")


Does click not take care of invalid values being passed :(

It does, I added it by accident based on habits :P

python/cuml/cuml/experimental/accel/__main__.py

betatim · 2024-12-20T13:58:07Z

python/cuml/cuml/experimental/accel/__main__.py

+        # Convert to sklearn estimator
+        sklearn_estimator = accelerated_estimator.as_sklearn()
+
+        # Save using chosen format
+        with open(output, "wb") as f:
+            serializer.dump(sklearn_estimator, f)
+
+        # Exit after conversion


Do we need the comments? They seem like they repeat what the code says. I like comments that explain why the code is the way it is, but I don't think we need that here as it is pretty straightforward

python/cuml/cuml/internals/base.pyx

betatim · 2024-12-20T14:07:13Z

python/cuml/cuml/tests/test_sklearn_import_export.py

+@pytest.fixture
+def random_state():
+    return 42


Do we really need this instead of using 42 in the tests directly?

We could have a global version of this that allows us to run the tests with several seeds, but maybe something to tackle in the future/new PR

We don't really need it at all

Co-authored-by: Tim Head <betatim@gmail.com>

wphicks

Pre-approving this given vacation schedules. Looks great to me once current discussion is resolved.

viclafargue

LGTM, addding the deepcopy optional parameter to both functions is probably optimal, but both solutions work for me.

betatim

(Still) LGTM. Modulo removing that "not at all needed random state" :D

dantegd · 2025-01-14T16:40:31Z

/merge

FEA Add first version of to and from_sklearn APIs to estimators that …

8473259

…support it

github-actions bot added the Cython / Python Cython or Python issue label Oct 8, 2024

FIX typo

c615b55

dantegd mentioned this pull request Oct 8, 2024

[QST] Can cuML models be run with sklearn #6093

Open

ENH Add fixes, behavior for non-supported estimators, and pytests

5b2a3b1

dantegd added 5 commits December 19, 2024 16:09

Merge cuML branch-25.02

1616013

ENH rename methods and rework functionality as suggested by review

de96508

ENH refactor pytests for new functionality and better testing

cc53477

FEA add CLI utility to convert a file

a34c046

FIX style fixes

1ad8eff

dantegd changed the title ~~Add to_sklearn and from_sklearn APIs to serialize to CPU sklearn-estimators for supported models~~ Add as_sklearn and from_sklearn APIs to serialize to CPU sklearn-estimators for supported models Dec 19, 2024

github-actions bot added conda conda issue CMake CUDA/C++ labels Dec 19, 2024

dantegd changed the base branch from branch-24.12 to branch-25.02 December 19, 2024 23:57

viclafargue reviewed Dec 20, 2024

View reviewed changes

betatim reviewed Dec 20, 2024

View reviewed changes

Apply suggestions from code review

f925864

Co-authored-by: Tim Head <betatim@gmail.com>

github-actions bot removed conda conda issue CMake CUDA/C++ labels Dec 20, 2024

dantegd added 2 commits December 20, 2024 13:31

ENH deepcopy parameter

228097c

FIX style fixes

12b2e3e

dantegd added feature request New feature or request non-breaking Non-breaking change labels Dec 20, 2024

dantegd marked this pull request as ready for review December 20, 2024 19:35

dantegd requested a review from a team as a code owner December 20, 2024 19:35

dantegd requested review from csadorf and wphicks December 20, 2024 19:35

wphicks approved these changes Dec 20, 2024

View reviewed changes

viclafargue approved these changes Dec 23, 2024

View reviewed changes

dantegd and others added 4 commits December 24, 2024 10:14

Update t_sne.pyx for correctly transferring embedding_ to CPU

8058856

Merge branch 'branch-25.02' into to_from_sklearn

4fd9cc5

FIX change underscores

ad87845

FIX style fixes

9dce6d6

betatim approved these changes Jan 14, 2025

View reviewed changes

rapids-bot bot merged commit cf259f6 into rapidsai:branch-25.02 Jan 14, 2025
62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `as_sklearn` and `from_sklearn` APIs to serialize to CPU sklearn-estimators for supported models #6102

Add `as_sklearn` and `from_sklearn` APIs to serialize to CPU sklearn-estimators for supported models #6102

dantegd commented Oct 8, 2024

betatim commented Nov 19, 2024

copy-pr-bot bot commented Dec 19, 2024

dantegd commented Dec 20, 2024

viclafargue left a comment

viclafargue Dec 20, 2024 •

edited

Loading

dantegd Dec 20, 2024

viclafargue Dec 20, 2024

dantegd Dec 20, 2024

csadorf Jan 13, 2025

betatim Jan 14, 2025

viclafargue Dec 20, 2024

betatim Dec 20, 2024

viclafargue Dec 20, 2024

dantegd Dec 20, 2024

csadorf Jan 13, 2025

betatim Dec 20, 2024

dantegd Dec 20, 2024

betatim Dec 20, 2024

betatim Dec 20, 2024

dantegd Dec 20, 2024

wphicks left a comment

viclafargue left a comment

betatim left a comment •

edited

Loading

dantegd commented Jan 14, 2025

		else:
		raise ValueError(f"Serializer {format} not supported.")

Add as_sklearn and from_sklearn APIs to serialize to CPU sklearn-estimators for supported models #6102

Add as_sklearn and from_sklearn APIs to serialize to CPU sklearn-estimators for supported models #6102

Conversation

dantegd commented Oct 8, 2024

betatim commented Nov 19, 2024

copy-pr-bot bot commented Dec 19, 2024

dantegd commented Dec 20, 2024

viclafargue left a comment

Choose a reason for hiding this comment

viclafargue Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wphicks left a comment

Choose a reason for hiding this comment

viclafargue left a comment

Choose a reason for hiding this comment

betatim left a comment • edited Loading

Choose a reason for hiding this comment

dantegd commented Jan 14, 2025

Add `as_sklearn` and `from_sklearn` APIs to serialize to CPU sklearn-estimators for supported models #6102

Add `as_sklearn` and `from_sklearn` APIs to serialize to CPU sklearn-estimators for supported models #6102

viclafargue Dec 20, 2024 •

edited

Loading

betatim left a comment •

edited

Loading