-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache model loading in model card #243
Comments
Hi! I would like to help here. If we were to use the md5 sum of the file that the path points to, then we would have to calculate the md5 sum every time that |
Each time when the function is called with a str or Path argument, yes. I would assume that calculating the md5 should be really fast, but ideally that could be measured before we change the code. |
Actually, it's better to use the md5 sum of the model file rather than loading the model file all the time anyways. I could start a PR on this issue. |
Well, if loading the file takes 100 ms and calculating the md5 takes 99 ms, arguably it's better to just always load the file. That's why we should measure, just to be sure. Instead of md5, we could also consider storing and checking the time the file was last modified using |
Okay, I understand. Although, I think that Maybe we can use some metadata of the model to check whether it was changed or not. |
The idea is not to "never reload a file twice", the idea is to "not reload the same content in most cases". So if a solution would reduce reloading the same file twice for 95% of the cases, that's good enough of a solution, especially if it keeps it simple. |
Thanks for clarifying. I think I could do some testing on the loading times and the md5 hash of a model file, and give some feedback on how they compare? |
Sure. |
Sorry I haven't posted a response to what I said I would test. I will give a response in the next 2-3 days :) |
Hey! I have some results to share with you. I have tested loading times and the time taken to compute the md5sum of the pickle file of the model with some examples. I have used two models to do that. The first model I have used is from Example 1 - sklearn.ensemble.HistGradientBoostingClassifier
Example 2 - sklearn.ensemble.ExtraTreesClassifier with
Example 3 - sklearn.ensemble.ExtraTreesClassifier with
|
Thanks a lot for doing the tests. A first conclusion would be that for small models, calling the hashing function is actually as slow as loading the models, only for big models would we notice a difference. But it can still be quite significant, as shown in your last example. For this test, could you please check what the times would be when calling the hashing functions from Python using hashlib? I would expect that it adds a bit of an overhead. Also, just in case we require a more secure hashing function, could you please check both md5 and, say, sha256? And maybe run the tests a few times using timeit (or |
I measured the times with the Test 1
Test 2
Test 3
It looks like sha256 is faster than md5 from these tests. Let me know if you want me to share the code snippet that I've used. |
So I assume the 3 tested models are the same as in your previous post. If so, do I interpret the data correctly that calling md5 from Python (2nd comment) on the small model is 10x faster (0.0003 sec) than calling it from the command line (1st comment, 0.003 sec)? That's strange. But for the other models, there is no big difference, so I guess we can ignore this outlier. Anyway, overall it seems like sha256 is the better choice and is consistently faster than actually loading the model, especially for larger models. Then I think it's safe to say we should go ahead and implement the caching. |
Yes, your assumption is correct. I didn't spot that outlier, I tested again from the command line and it gives me the same result. Okay, then I will open a PR to implement the caching! :) |
I just learned about |
I think that we wouldn't be able to use |
I opened a PR #299 that implements the functionality discussed. However, there's a small problem. I worked on a previous PR #207 which is supposed to be merged, but I worked on this PR in the main branch of my fork (my mistake). Therefore, I had to branch off from main to implement cache model loading and now all my previous commits are attached to #299. I'm sorry if that's an issue. |
Follow up to #96 and #205.
The
_load_model
functionskops/skops/card/_model_card.py
Lines 168 to 205 in 30ddea7
is currently called each time when the
model
of a card is accessed, e.g. whenrepr
is called. When the model is a path, it has to be loaded each time, which can be expensive. Therefore, we would like to add a cache to_load_model
.A simple
functools.{lru_cache,cache}
is, however, not sufficient. This is because the argument, in this case the model path, could remain the same while the model on the drive is overwritten by a new model. The cache key would need to be something else, like the md5 sum of the file that the path points to.The text was updated successfully, but these errors were encountered: