-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Refactor with a SaveState and avoid duplicate numpy arrays #173
MNT Refactor with a SaveState and avoid duplicate numpy arrays #173
Conversation
Previously, we were passing only the directory path, now we pass a dataclass that contains the path but also the protocol and a way to memoize objects. This way, those functions that, for some reason, need to memoize the object, can now do so. This allows us, for instance, to make sure that the same numpy array is not saved multiple times on disk (see discussion in skops-dev#150). I went ahead and implemented memoization for numpy arrays (and scipy sparse matrices) in order to ensure that this refactor actually is sufficient to solve the initial problem. Design The design goal here was to make as few code changes as possible while still enabling much greater flexibility. E.g., I wanted to leave the individual get_state/get_instance methods as they are (except for renaming one of the parameters), since I find the current structure very readable. That's why I didn't go with a giant class with a high number of get_state/get_instance methods, which is another solution we discussed. The SaveState object can serve a similar purpose as self would have served if we choose a class based approach, i.e. it can hold the state necessary for the functions to perform their work and in the future, we can add more state if necessary. For the time being, I didn't change the get_instance methods at all, even though we could make an analogous change there, passing a LoadState instead of src everywhere. Let me know if I should make that change for consistency or if we should leave those functions untouched until it becomes necessary. The memoization part has been modeled to be similar to what pickle does but tailored to our needs. Coincidental changes While working on skops-dev#150, I discovered a minor bug where trying to store an object numpy array resulted in the creation of a broken .npy file being left over. This is because numpy tries to write to the file until it encounters an error and raises, but then doesn't clean up said file. Before this bugfix, we would include that broken file in the zip archive, although we wouldn't do anything with it. Now, no such file is being created. Since skops-dev#150 won't be merged, I added the bugfix and a corresponding test here. Moreover, while working on this, since I had to touch the signature of the get_state functions, I also added type hints in accordance to the rest of the code base ("light" types). While working on that, I found that we have a single case where the get_state function would not return a dict, namely when the object is a primitive type that can be json-serialized, in which case we return a string. I changed the return type there to always be dict for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the types, LGTM. The typehints really makes this less readable and I don't like the repeated types all over the place. Types are evil :D
# we use numpy's internal save mechanism to store the dtype by | ||
# saving/loading an empty array with that dtype. | ||
tmp = np.ndarray(0, dtype=obj) | ||
tmp: np.typing.NDArray = np.ndarray(0, dtype=obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of typing local variables. if mypy is failing on such a line, we should just disable those checks if possible.
I added them here for consistency and because it makes working with the new Since you merged: Do you want them removed or did you accept your fate? ^^ |
I have accepted my fate for now to see how I feel about it while coding it lol. The same needs to happen for get_instance methods. |
Do you mean a refactor of
|
Yes, I thought I can leave that to the audit PR, since that's where we need it. |
Okay, LMK if you want me to take a look. |
Supersedes #150
Description
Previously, we were passing only the directory path, now we pass a
dataclass
that contains the path but also the protocol and a way to memoize objects.This way, those functions that, for some reason, need to memoize the object, can now do so. This allows us, for instance, to make sure that the same numpy array is not saved multiple times on disk (see discussion in #150).
I went ahead and implemented memoization for numpy arrays (and scipy sparse matrices) in order to ensure that this refactor actually is sufficient to solve the initial problem.
Design
The design goal here was to make as few code changes as possible while still enabling much greater flexibility. E.g., I wanted to leave the individual
get_state
/get_instance
methods as they are (except for renaming one of the parameters), since I find the current structure very readable. That's why I didn't go with a giant class with a high number ofget_state
/get_instance
methods, which is another solution we discussed.The
SaveState
object can serve a similar purpose as self would have served if we choose a class based approach, i.e. it can hold the state necessary for the functions to perform their work and in the future, we can add more state if necessary.For the time being, I didn't change the
get_instance
methods at all, even though we could make an analogous change there, passing aLoadState
instead ofsrc
everywhere. Let me know if I should make that change for consistency or if we should leave those functions untouched until it becomes necessary.The memoization part has been modeled to be similar to what pickle does but tailored to our needs.
Coincidental changes
While working on #150, I discovered a minor bug where trying to store an object numpy array resulted in the creation of a broken
.npy
file being left over. This is because numpy tries to write to the file until it encounters an error and raises, but then doesn't clean up said file. Before this bugfix, we would include that broken file in the zip archive, although we wouldn't do anything with it. Now, no such file is being created. Since #150 won't be merged, I added the bugfix and a corresponding test here.Moreover, while working on this, since I had to touch the signature of the
get_state
functions, I also added type hints in accordance to the rest of the code base ("light" types).While working on that, I found that we have a single case where the
get_state
function would not return a dict, namely when the object is a primitive type that can be json-serialized, in which case we return a string. I changed the return type there to always be dict for consistency (so in this case a dict that contains the json-serialized string).