-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[package] add unique_id to StorageImpl and use as storage identifier #59488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 7565e82 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
…identifier" Changes made: - Adds unique_id field to `StorageImpl` and added access to the field from Python - Changed `torch.package` and mobile's `tensor_cdata_naming_scheme` to `tensor_unique_id_naming_scheme` to use this unique identifier for storages serialization instead of the cptr to the `StorageImpl` These change are needed to allow for quantization to create/delete tensors during serialization (which leads to the StorageImpl cptrs being reused) [ghstack-poisoned]
|
@Lilyjjo has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
dreiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I'll give others a chance to comment before accepting. In particular, I like that this approach makes the storage files have shorter names that are deterministic (assuming you have a deterministic program that you restart every time you want to export your model).
If others feel strongly that we'd be better off by continuing to use the storage addresses and just keep references during serialization to prevent deallocation, I'm fine with that as well.
| // local to process cuda memory allocation | ||
| bool received_cuda_; | ||
| Allocator* allocator_; | ||
| uint64_t unique_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have a trailing underscore to match the other members.
|
|
||
| namespace c10 { | ||
|
|
||
| static std::atomic<std::uint64_t> next_unique_storage_id{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe drop the std:: for uint64_t. I don't think that's common in our codebase.
Note to self: Profilo uses a std::atomic<int64_t>, so I assume this is safe even on 32-bit apps that are built with very old toolchains.
|
Should just holding onto the reference of tensors being pickled, for the duration of serialize, be sufficient and perhaps cleaner? |
@kimishpatel Yeah, that was the other implementation idea that was discussed. It's unclear which solution is better. The |
Did not know about torch.package interaction. Makes sense now. |
| return received_cuda_; | ||
| } | ||
|
|
||
| uint64_t get_unique_id() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment about how this is used
|
I think I understand how this works. cc @swolchok in case this may be a perf. issue since there's now an |
|
There are probably some edge cases that may not be handled by this code, and am wondering if these can be documented somewhere, since I find that extremely useful (and I assume others will too). My understanding is that the main motivation behind this change is to prevent the serializer from getting confused if it sees the same memory address but is actually for 2 different tensors (due to memory address re-use). Should we document this somewhere (maybe where @raziel mentioned??)
|
|
I can't seem to figure out how to comment on some code that isn't being change in github, so pasting a diffusion link here. Q. Should the ID be copied over in this case too? Q. Also, should |
I think this is a general case that exists outside of this diff also? |
@Lilyjjo although I wonder if tensor creation introduces perf issues. |
Yes! My mental model is that this change solves some problems (not all) - the question is if we should more rigorously document the handled and not handled cases so that the next time we need to iterate on this, it's super clear what needs to be changed. |
|
|
||
| namespace c10 { | ||
|
|
||
| static std::atomic<std::uint64_t> next_unique_storage_id{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wary of using a global to track this information and add an additional 8 bytes to StorageImpl. It also seems to be a local answer to a more fundamental question, which is: how do we handle identity for torch objects?
So far, we have mostly followed Python's convention and used the pointer. This is also how we do id()/hash() for Python tensors and their storages and thus for their serialization.
This obviously has serious issues if we are re-using memory, but has the advantage that it's consistent across all objects and matches what Python does. @ezyang any thoughts here?
On second thought, I think that the |
My understanding is that from_blob will give you non-owning Tensor since the void* Tensor is created from is not owned by Tensor. In that case you cannot make any assumption. And if such a tensor (very unlikely) gets serialized then yes you can create two tensors via from_blob using same void* that can get serialized separately. And I think that is ok. |
|
I can get a version of the reference tracking version of this diff out as well, after discussion with @suo it seems more reasonable than I originally thought for torch.package. @dreiss we can also discuss ways to shorten the names of the tensors in the serialization file (possibly by removing the .storage tag as a start) |
I was thinking about this over the weekend as well. What about, instead of holding on to a vector or set of seen StorageImpl objects, we hold onto a map from StorageImpl to integer where we just increment the number for each added StorageImpl (basically, |
|
Chiming in to say that I am still pro solution that retains StorageImpls so you don't have address reuse problem. |
|
Yeah, I agree. We already have a |
This can be done as a separate diff, since the reader doesn't care what the file names are, so if this feature is holding up progress in any way, let's punt on it and move forward with the fix. |
|
Alternative reference-tracking PR: #59642
I can look into the naming fixes in a follow up! This PR solves storage identity problems for TorchScript/Mobile but not for |
Isn’t that a bit dangerous? The signature of the problem was that lite interpreter module was unloadable occasionally but I suspect the worst could be that you get similar tensors but not the same which leads to incorrect result. And this becomes very hard to debug. I don’t know what is the minimal that can be done here such as throwing an error if this is about to happen? |
Closing this as we took the storage reference retaining route, implemented for mobile here: #59642, and a fix for the same problem in torch.package here: #59735 Thanks y'all for the input and for helping solve this problem! |
Stack from ghstack:
Changes made:
StorageImpland added access to the field from Pythontorch.packageand mobile'stensor_cdata_naming_schemetotensor_unique_id_naming_schemeto use this unique identifier for storages serialization instead of the cptr to theStorageImplThese change are needed to allow for quantization to create/delete tensors during serialization (which leads to the StorageImpl cptrs being reused). The only functional change for the torch.package/mobile code is that now storages have a different naming scheme.
Differential Revision: D28911924