-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. #706
Comments
You can see the code here : https://github.com/antngh/sklearn-estimator-caching |
Hey @antngh , thanks for the issue and for already putting the effort into this. I took a sneak peak at the repo, and just by the volume it could deserve to be in its own project/repo. I can imagine people having multiple use cases for such caching mechanism, and therefore different feature request for it. I will wait for @koaning to weight on this as well. |
I have also observed pipelines becomes slower with caching on the sklearn side. If the numpy array going in is huge, the hashing might actually be slower than the pipeline. Doesn't happen all the time, but worth to keep in the back of your mind. I wonder, if the final element of a pipeline is skipped, why not add a FunctionTransformer at the end? That behaves like an identity function if you pass no arguments, but it will act as a "final" transformer. Does that not work with the memory flag in a normal pipeline? Another sensible way to cache an intermediate output is to manually store a transformer array in memory or to write it to disk manually from there. This only works if you know exactly what needs to be remembered and if it does not change, but it might be easier to reason about compared to the hashing involved with a caching mechanism. I am personally a little hestitant to support it here because I am a bit wary of all the edge cases. But if you have a compelling benchmark, I'd sure be all ears! |
Thanks both. Some timing here: https://github.com/antngh/sklearn-estimator-caching/blob/main/notebooks/caching_timing.ipynb The overhead for the inference calls are about the same as for the fit calls (pipeline has no equivalent caching here). I first created this code specifically due to some very slow custom transformers I was working with. In my case it wasn't a matter of a normal transformer with a huge dataset, but rather a big dataset but a super slow transformation step. In that case I see a huge improvement when using this wrapper. You're right we could manually save/load the data but that quickly becomes hard to track and manage.
I fully understand. |
Please let me know if this is not the correct place or way to start this discussion.
I have some code for a wrapper to an estimator (transformer or predictor) that quickly saves the object and the data to disk. If the wrapped estimator is called with an identical instance (same properties etc) and with the same input data, then it will fetch from disk rather than rerunning the corresponding fit/predict/transform/(etc) code. The wrapped estimator behaves exactly as an estimator would in all the cases I've tested.
Sklearn provides something similar with the
memory
arg in the pipeline class, but it doesn't extend to inference, only fitting, and even then, it won't apply to the last transformer in the pipeline.This is especially useful for when there is an estimator that has a slow running predict/transform step, and you want to run this pipeline quickly. It will run again if needed (if either the estimator or the input data has changed), but otherwise will just load from file. This also maintains it across runs - if you restart the kernel or run the script again, you can pick up where you left off. This isn't intended as a data store. It can really speed up the development of pipelines when there are slow steps.
Please let me know if this functionality in full or in part - say the code that checks if the data is the same - could be useful, and I will look and adding it to this repo. I can't commit to fully maintain it going forward, but as of now it seems to work well.
The text was updated successfully, but these errors were encountered: