Cache model loading in model card #243

BenjaminBossan · 2022-12-06T11:21:38Z

Follow up to #96 and #205.

The _load_model function

Lines 168 to 205 in 30ddea7

    
           def _load_model(model: Any, trusted=False) -> Any: 
        
               """Return a model instance. 
        
               Loads the model if provided a file path, if already a model instance return 
        
               it unmodified. 
        
               Parameters 
        
               ---------- 
        
               model : pathlib.Path, str, or sklearn estimator 
        
                   Path/str or the actual model instance. if a Path or str, loads the model. 
        
               trusted : bool, default=False 
        
                   Passed to :func:`skops.io.load` if the model is a file path and it's 
        
                   a `skops` file. 
        
               Returns 
        
               ------- 
        
               model : object 
        
                   Model instance. 
        
               """ 
        
               if not isinstance(model, (Path, str)): 
        
                   return model 
        
               model_path = Path(model) 
        
               if not model_path.exists(): 
        
                   raise FileNotFoundError(f"File is not present: {model_path}") 
        
               try: 
        
                   if zipfile.is_zipfile(model_path): 
        
                       model = load(model_path, trusted=trusted) 
        
                   else: 
        
                       model = joblib.load(model_path) 
        
               except Exception as ex: 
        
                   msg = f'An "{type(ex).__name__}" occurred during model loading.' 
        
                   raise RuntimeError(msg) from ex 
        
               return model

is currently called each time when the model of a card is accessed, e.g. when repr is called. When the model is a path, it has to be loaded each time, which can be expensive. Therefore, we would like to add a cache to _load_model.

A simple functools.{lru_cache,cache} is, however, not sufficient. This is because the argument, in this case the model path, could remain the same while the model on the drive is overwritten by a new model. The cache key would need to be something else, like the md5 sum of the file that the path points to.

The text was updated successfully, but these errors were encountered:

jucamohedano · 2022-12-08T18:53:26Z

Hi! I would like to help here. If we were to use the md5 sum of the file that the path points to, then we would have to calculate the md5 sum every time that _load_model is called ?

BenjaminBossan · 2022-12-08T20:09:12Z

then we would have to calculate the md5 sum every time that _load_model is called ?

Each time when the function is called with a str or Path argument, yes.

I would assume that calculating the md5 should be really fast, but ideally that could be measured before we change the code.

jucamohedano · 2022-12-08T22:03:15Z

Actually, it's better to use the md5 sum of the model file rather than loading the model file all the time anyways. I could start a PR on this issue.

BenjaminBossan · 2022-12-09T12:39:01Z

Actually, it's better to use the md5 sum of the model file rather than loading the model file all the time anyways. I could start a PR on this issue.

Well, if loading the file takes 100 ms and calculating the md5 takes 99 ms, arguably it's better to just always load the file. That's why we should measure, just to be sure.

Instead of md5, we could also consider storing and checking the time the file was last modified using getmtime. Maybe there is some precedence for what the best approach would be.

jucamohedano · 2022-12-12T18:24:15Z

Okay, I understand. Although, I think that getmtime could make us reload the model file because its modification time changed, but the content of the model file could still be the same, in that case we could compute the hash of file to double check. But again, supposing that hashing with md5 sum is very fast.

Maybe we can use some metadata of the model to check whether it was changed or not.

adrinjalali · 2022-12-13T09:26:24Z

The idea is not to "never reload a file twice", the idea is to "not reload the same content in most cases". So if a solution would reduce reloading the same file twice for 95% of the cases, that's good enough of a solution, especially if it keeps it simple.

jucamohedano · 2022-12-14T11:19:48Z

Thanks for clarifying. I think I could do some testing on the loading times and the md5 hash of a model file, and give some feedback on how they compare?

adrinjalali · 2022-12-14T11:40:43Z

Sure.

jucamohedano · 2023-01-05T21:10:49Z

Sorry I haven't posted a response to what I said I would test. I will give a response in the next 2-3 days :)

jucamohedano · 2023-01-13T22:43:53Z

Hey! I have some results to share with you.

I have tested loading times and the time taken to compute the md5sum of the pickle file of the model with some examples. I have used two models to do that. The first model I have used is from examples/plot_model_card.py which uses sklearn.ensemble.HistGradientBoostingClassifier, and the other model I used is sklearn.ensemble.ExtraTreesClassifier. Here are the results:

Example 1 - sklearn.ensemble.HistGradientBoostingClassifier
Model size = 228K

time md5sum /tmp/skops-_gor4zib/skops-4e68jjra.pkl
3e58bbfade12f97a9bf1e989ed8ab151  /tmp/skops-_gor4zib/skops-4e68jjra.pkl
real	0m0.003s

Time of _load_model()
0.006422758102416992s

Example 2 - sklearn.ensemble.ExtraTreesClassifier with n_estimators=10000 - number of trees in the forest
Model size = 75M

time md5sum /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl
ada4e1ea1fda4a72cfa52ebf6458c557  /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl
real	0m0.097s

Time of _load_model()
0.9678699970245361s

Example 3 - sklearn.ensemble.ExtraTreesClassifier with n_estimators=100000 - number of trees in the forest
Model size = 749M

time md5sum /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl 
e693e402475155f07ae13a56dbb82b12  /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl
real	0m0.944s

Time of _load_model()
8.120674848556519s

BenjaminBossan · 2023-01-16T10:16:53Z

Thanks a lot for doing the tests. A first conclusion would be that for small models, calling the hashing function is actually as slow as loading the models, only for big models would we notice a difference. But it can still be quite significant, as shown in your last example.

For this test, could you please check what the times would be when calling the hashing functions from Python using hashlib? I would expect that it adds a bit of an overhead. Also, just in case we require a more secure hashing function, could you please check both md5 and, say, sha256? And maybe run the tests a few times using timeit (or %timeit in case you're using ipython/jupyter).

jucamohedano · 2023-01-21T13:53:24Z

I measured the times with the timeit python module. Here are three sets of results:

Test 1

Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0003169089904986322s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.11238793900702149s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.088716946018394s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.00016041501658037305s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.06882680201670155s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6861052309977822s

Test 2

Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0002976770047098398s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.11226405500201508s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.101641189015936s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.0003235790063627064s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.07127582101384178s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6806159189436585s

Test 3

Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0009365510195493698s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.1162097500055097s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.1024849910172634s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.0003798969555646181s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.06875280395615846s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6962733599939384s

It looks like sha256 is faster than md5 from these tests. Let me know if you want me to share the code snippet that I've used.

BenjaminBossan · 2023-01-23T10:50:37Z

So I assume the 3 tested models are the same as in your previous post.

If so, do I interpret the data correctly that calling md5 from Python (2nd comment) on the small model is 10x faster (0.0003 sec) than calling it from the command line (1st comment, 0.003 sec)? That's strange. But for the other models, there is no big difference, so I guess we can ignore this outlier.

Anyway, overall it seems like sha256 is the better choice and is consistently faster than actually loading the model, especially for larger models. Then I think it's safe to say we should go ahead and implement the caching.

jucamohedano · 2023-01-28T16:20:38Z

Yes, your assumption is correct.

I didn't spot that outlier, I tested again from the command line and it gives me the same result.

Okay, then I will open a PR to implement the caching! :)

BenjaminBossan · 2023-02-10T11:46:33Z

I just learned about cached_property, which was introduced in Python 3.8. Since we dropped 3.7, maybe this could be an option to use for us.

jucamohedano · 2023-02-11T21:36:33Z

I think that we wouldn't be able to use cached_property here because it's only applicable to methods

jucamohedano · 2023-02-12T20:59:52Z

I opened a PR #299 that implements the functionality discussed. However, there's a small problem. I worked on a previous PR #207 which is supposed to be merged, but I worked on this PR in the main branch of my fork (my mistake). Therefore, I had to branch off from main to implement cache model loading and now all my previous commits are attached to #299. I'm sorry if that's an issue.

adrinjalali added the help wanted Extra attention is needed label Dec 6, 2022

adrinjalali added the model cards label Jan 24, 2023

jucamohedano mentioned this issue Feb 12, 2023

Cache model loading in model card #299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache model loading in model card #243

Cache model loading in model card #243

BenjaminBossan commented Dec 6, 2022

jucamohedano commented Dec 8, 2022

BenjaminBossan commented Dec 8, 2022

jucamohedano commented Dec 8, 2022 •

edited

Loading

BenjaminBossan commented Dec 9, 2022

jucamohedano commented Dec 12, 2022

adrinjalali commented Dec 13, 2022

jucamohedano commented Dec 14, 2022

adrinjalali commented Dec 14, 2022

jucamohedano commented Jan 5, 2023

jucamohedano commented Jan 13, 2023

BenjaminBossan commented Jan 16, 2023

jucamohedano commented Jan 21, 2023

BenjaminBossan commented Jan 23, 2023

jucamohedano commented Jan 28, 2023

BenjaminBossan commented Feb 10, 2023

jucamohedano commented Feb 11, 2023

jucamohedano commented Feb 12, 2023

Cache model loading in model card #243

Cache model loading in model card #243

Comments

BenjaminBossan commented Dec 6, 2022

jucamohedano commented Dec 8, 2022

BenjaminBossan commented Dec 8, 2022

jucamohedano commented Dec 8, 2022 • edited Loading

BenjaminBossan commented Dec 9, 2022

jucamohedano commented Dec 12, 2022

adrinjalali commented Dec 13, 2022

jucamohedano commented Dec 14, 2022

adrinjalali commented Dec 14, 2022

jucamohedano commented Jan 5, 2023

jucamohedano commented Jan 13, 2023

BenjaminBossan commented Jan 16, 2023

jucamohedano commented Jan 21, 2023

BenjaminBossan commented Jan 23, 2023

jucamohedano commented Jan 28, 2023

BenjaminBossan commented Feb 10, 2023

jucamohedano commented Feb 11, 2023

jucamohedano commented Feb 12, 2023

jucamohedano commented Dec 8, 2022 •

edited

Loading