[package] add unique_id to StorageImpl and use as storage identifier #59488

lobstergrindset · 2021-06-04T21:41:30Z

Stack from ghstack:

[package] add unique_id to StorageImpl and use as storage identifier #59488 [package] add unique_id to StorageImpl and use as storage identifier

Changes made:

Adds unique_id field to StorageImpl and added access to the field from Python
Changed torch.package and mobile's tensor_cdata_naming_scheme to tensor_unique_id_naming_scheme to use this unique identifier for storages serialization instead of the cptr to the StorageImpl

These change are needed to allow for quantization to create/delete tensors during serialization (which leads to the StorageImpl cptrs being reused). The only functional change for the torch.package/mobile code is that now storages have a different naming scheme.

Differential Revision: D28911924

[ghstack-poisoned]

facebook-github-bot · 2021-06-04T21:41:36Z

💊 CI failures summary and remediations

As of commit 7565e82 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 05 04:46:27 AssertionError: False is not tr...lowed difference with rtol=0 and atol=0 is only 0!

Jun 05 04:46:27 ----------------------------------------------------------------------
Jun 05 04:46:27 Traceback (most recent call last):
Jun 05 04:46:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 398, in wrapper
Jun 05 04:46:27     self._join_processes(fn)
Jun 05 04:46:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 590, in _join_processes
Jun 05 04:46:27     self._check_return_codes(elapsed_time)
Jun 05 04:46:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 643, in _check_return_codes
Jun 05 04:46:27     i, first_process.exitcode, p.exitcode
Jun 05 04:46:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1403, in assertEqual
Jun 05 04:46:27     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jun 05 04:46:27 AssertionError: False is not true : Scalars failed to compare as equal! Comparing 1 and -11 gives a difference of 12, but the allowed difference with rtol=0 and atol=0 is only 0!
Jun 05 04:46:27 Expect process 1 exit code to match Process 0 exit code of -11, but got 1
Jun 05 04:46:27 
Jun 05 04:46:27 ----------------------------------------------------------------------
Jun 05 04:46:27 Ran 213 tests in 396.897s
Jun 05 04:46:27 
Jun 05 04:46:27 FAILED (failures=1, skipped=118)
Jun 05 04:46:27 
Jun 05 04:46:27 Generating XML reports...
Jun 05 04:46:27 Generated XML report: test-reports/dist-nccl/distributed.test_distributed_spawn/TEST-TestDistBackendWithSpawn-20210605043950.xml
Jun 05 04:46:27 Traceback (most recent call last):

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…identifier" Changes made: - Adds unique_id field to `StorageImpl` and added access to the field from Python - Changed `torch.package` and mobile's `tensor_cdata_naming_scheme` to `tensor_unique_id_naming_scheme` to use this unique identifier for storages serialization instead of the cptr to the `StorageImpl` These change are needed to allow for quantization to create/delete tensors during serialization (which leads to the StorageImpl cptrs being reused) [ghstack-poisoned]

ghstack-source-id: 6be4a04 Pull Request resolved: #59488

lobstergrindset · 2021-06-04T21:58:07Z

@Lilyjjo has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dreiss

This looks good to me. I'll give others a chance to comment before accepting. In particular, I like that this approach makes the storage files have shorter names that are deterministic (assuming you have a deterministic program that you restart every time you want to export your model).

If others feel strongly that we'd be better off by continuing to use the storage addresses and just keep references during serialization to prevent deallocation, I'm fine with that as well.

dreiss · 2021-06-04T22:15:43Z

c10/core/StorageImpl.h

  // local to process cuda memory allocation
  bool received_cuda_;
  Allocator* allocator_;
+  uint64_t unique_id;


Should have a trailing underscore to match the other members.

dreiss · 2021-06-04T22:20:40Z

c10/core/StorageImpl.h

+
 namespace c10 {

+static std::atomic<std::uint64_t> next_unique_storage_id{0};


Maybe drop the std:: for uint64_t. I don't think that's common in our codebase.

Note to self: Profilo uses a std::atomic<int64_t>, so I assume this is safe even on 32-bit apps that are built with very old toolchains.

kimishpatel · 2021-06-04T22:27:09Z

Should just holding onto the reference of tensors being pickled, for the duration of serialize, be sufficient and perhaps cleaner?

lobstergrindset · 2021-06-04T22:41:55Z

Should just holding onto the reference of tensors being pickled, for the duration of serialize, be sufficient and perhaps cleaner?

@kimishpatel Yeah, that was the other implementation idea that was discussed. It's unclear which solution is better. The torch.package pathway could get a little ugly with holding onto the StorageImpl references because users are able to place as many models into the same archive as they want, so we'd have to write up some small infra to track the StorageImpl pointers both on the python and cpp side across the lifetime of the PackageExporter. It would be doable, but it would add more complexity to the torch.package code which is undesirable from the torch.package perspective

kimishpatel · 2021-06-04T23:03:48Z

Should just holding onto the reference of tensors being pickled, for the duration of serialize, be sufficient and perhaps cleaner?

@kimishpatel Yeah, that was the other implementation idea that was discussed. It's unclear which solution is better. The torch.package pathway could get a little ugly with holding onto the StorageImpl references because users are able to place as many models into the same archive as they want, so we'd have to write up some small infra to track the StorageImpl pointers both on the python and cpp side across the lifetime of the PackageExporter. It would be doable, but it would add more complexity to the torch.package code which is undesirable from the torch.package perspective

Did not know about torch.package interaction. Makes sense now.

raziel · 2021-06-05T00:58:56Z

c10/core/StorageImpl.h

    return received_cuda_;
  }

+  uint64_t get_unique_id() {


maybe add a comment about how this is used

dhruvbird · 2021-06-06T17:13:42Z

I think I understand how this works. cc @swolchok in case this may be a perf. issue since there's now an std::atomic being incremented for every new memory allocation (I think it should be okay since allocating memory is probably most costly anyway). @Lilyjjo is there a benchmark which shows the impact of this change (curious)?

dhruvbird · 2021-06-06T17:22:03Z

There are probably some edge cases that may not be handled by this code, and am wondering if these can be documented somewhere, since I find that extremely useful (and I assume others will too). My understanding is that the main motivation behind this change is to prevent the serializer from getting confused if it sees the same memory address but is actually for 2 different tensors (due to memory address re-use). Should we document this somewhere (maybe where @raziel mentioned??)

from_blob() case doesn't seem to be handled. i.e. if you create 2 tensors from the same (unchanged) memory using from_blob(), they will get 2 different IDs (I think) and hence they will be serialized as separate blobs (please could you confirm or deny this)?
Something else??

dhruvbird · 2021-06-06T17:37:44Z

I can't seem to figure out how to comment on some code that isn't being change in github, so pasting a diffusion link here.

https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/c10/core/StorageImpl.h?commit=05fa0747d70fbb47c1e4cd89b5c7947f80ee5ff5&lines=175-185

  /**
   * Can only be called when use_count is 1
   */
  void UniqueStorageShareExternalPointer(
      at::DataPtr&& data_ptr,
      size_t size_bytes) {
    data_ptr_ = std::move(data_ptr);
    size_bytes_ = size_bytes;
    allocator_ = nullptr;
    resizable_ = false;
  }

Q. Should the ID be copied over in this case too?

Q. Also, should set_data_ptr() also update the unique ID?

kimishpatel · 2021-06-06T19:30:16Z

There are probably some edge cases that may not be handled by this code, and am wondering if these can be documented somewhere, since I find that extremely useful (and I assume others will too). My understanding is that the main motivation behind this change is to prevent the serializer from getting confused if it sees the same memory address but is actually for 2 different tensors (due to memory address re-use). Should we document this somewhere (maybe where @raziel mentioned??)

from_blob() case doesn't seem to be handled. i.e. if you create 2 tensors from the same (unchanged) memory using from_blob(), they will get 2 different IDs (I think) and hence they will be serialized as separate blobs (please could you confirm or deny this)?

I think this is a general case that exists outside of this diff also?

kimishpatel · 2021-06-06T19:31:42Z

Should just holding onto the reference of tensors being pickled, for the duration of serialize, be sufficient and perhaps cleaner?

@kimishpatel Yeah, that was the other implementation idea that was discussed. It's unclear which solution is better. The torch.package pathway could get a little ugly with holding onto the StorageImpl references because users are able to place as many models into the same archive as they want, so we'd have to write up some small infra to track the StorageImpl pointers both on the python and cpp side across the lifetime of the PackageExporter. It would be doable, but it would add more complexity to the torch.package code which is undesirable from the torch.package perspective

Did not know about torch.package interaction. Makes sense now.

@Lilyjjo although I wonder if tensor creation introduces perf issues.

dhruvbird · 2021-06-07T06:52:57Z

I think this is a general case that exists outside of this diff also?

Yes! My mental model is that this change solves some problems (not all) - the question is if we should more rigorously document the handled and not handled cases so that the next time we need to iterate on this, it's super clear what needs to be changed.

suo · 2021-06-07T13:55:02Z

c10/core/StorageImpl.h

+
 namespace c10 {

+static std::atomic<std::uint64_t> next_unique_storage_id{0};


I am wary of using a global to track this information and add an additional 8 bytes to StorageImpl. It also seems to be a local answer to a more fundamental question, which is: how do we handle identity for torch objects?

So far, we have mostly followed Python's convention and used the pointer. This is also how we do id()/hash() for Python tensors and their storages and thus for their serialization.

This obviously has serious issues if we are re-using memory, but has the advantage that it's consistent across all objects and matches what Python does. @ezyang any thoughts here?

dhruvbird · 2021-06-07T15:22:07Z

I think this is a general case that exists outside of this diff also?

On second thought, I think that the from_blob() case may be getting handled differently if using the memory (pointer) address, since in that case one would consider the tensor storage to be the same. It of course has other issues that we can trying to solve with this diff/PR.

kimishpatel · 2021-06-07T15:58:40Z

I think this is a general case that exists outside of this diff also?

On second thought, I think that the from_blob() case may be getting handled differently if using the memory (pointer) address, since in that case one would consider the tensor storage to be the same. It of course has other issues that we can trying to solve with this diff/PR.

My understanding is that from_blob will give you non-owning Tensor since the void* Tensor is created from is not owned by Tensor. In that case you cannot make any assumption. And if such a tensor (very unlikely) gets serialized then yes you can create two tensors via from_blob using same void* that can get serialized separately. And I think that is ok.

lobstergrindset · 2021-06-07T23:06:56Z

I can get a version of the reference tracking version of this diff out as well, after discussion with @suo it seems more reasonable than I originally thought for torch.package. @dreiss we can also discuss ways to shorten the names of the tensors in the serialization file (possibly by removing the .storage tag as a start)

dreiss · 2021-06-08T05:24:58Z

discuss ways to shorten the names of the tensors in the serialization file

I was thinking about this over the weekend as well. What about, instead of holding on to a vector or set of seen StorageImpl objects, we hold onto a map from StorageImpl to integer where we just increment the number for each added StorageImpl (basically, if storage not in m: m[storage] = len(m)). That way, we'd get short and deterministic names. Even more deterministic than the current diff, you could say.

ezyang · 2021-06-08T16:38:31Z

Chiming in to say that I am still pro solution that retains StorageImpls so you don't have address reuse problem.

suo · 2021-06-08T17:03:28Z

Yeah, I agree. We already have a StorageContext for deserialization, we should have one for serialization. It can hold refs to the storages + map them shorter names as @dreiss suggests. That way, names can be handed out local to a given serialization context and we can ensure they're all good, without having to deal with difficult identity issues globally.

dreiss · 2021-06-08T18:22:04Z

map them shorter names as @dreiss suggests

This can be done as a separate diff, since the reader doesn't care what the file names are, so if this feature is holding up progress in any way, let's punt on it and move forward with the fix.

lobstergrindset · 2021-06-08T20:02:41Z

Alternative reference-tracking PR: #59642

map them shorter names as @dreiss suggests

I can look into the naming fixes in a follow up! This PR solves storage identity problems for TorchScript/Mobile but not for torch.package. This is because the torch.package won't retain references to storages that are serialized just from the python view (fixing this will requires some tweaking to other parts of the code base, but no users are known to be running into issues with this particular problem yet, so I think it's fine for now).

kimishpatel · 2021-06-09T04:20:42Z

won't retain references to storages that are serialized just from the python view (fixing this will requires some tweaking to other parts of the code base, but no users are known to be running into issues with this particular problem yet, so I think it's fine for now).

Isn’t that a bit dangerous? The signature of the problem was that lite interpreter module was unloadable occasionally but I suspect the worst could be that you get similar tensors but not the same which leads to incorrect result. And this becomes very hard to debug. I don’t know what is the minimal that can be done here such as throwing an error if this is about to happen?

lobstergrindset · 2021-06-30T17:43:15Z

Isn’t that a bit dangerous? The signature of the problem was that lite interpreter module was unloadable occasionally but I suspect the worst could be that you get similar tensors but not the same which leads to incorrect result. And this becomes very hard to debug. I don’t know what is the minimal that can be done here such as throwing an error if this is about to happen?

Closing this as we took the storage reference retaining route, implemented for mobile here: #59642, and a fix for the same problem in torch.package here: #59735

Thanks y'all for the input and for helping solve this problem!

[package] add unique_id to StorageImpl and use as storage identifier

2a73987

[ghstack-poisoned]

facebook-github-bot added the cla signed label Jun 4, 2021

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jun 4, 2021

lobstergrindset pushed a commit that referenced this pull request Jun 4, 2021

[package] add unique_id to StorageImpl and use as storage identifier

6cd4f1d

ghstack-source-id: 6be4a04 Pull Request resolved: #59488

lobstergrindset requested review from SplitInfinity, dreiss, ezyang and suo June 4, 2021 22:02

dreiss reviewed Jun 4, 2021

View reviewed changes

raziel reviewed Jun 5, 2021

View reviewed changes

suo reviewed Jun 7, 2021

View reviewed changes

iseeyuan mentioned this pull request Jun 8, 2021

hold references to storages during TorchScript serializaiton #59642

Closed

lobstergrindset closed this Jun 30, 2021

facebook-github-bot deleted the gh/Lilyjjo/83/head branch July 31, 2021 14:17


		namespace c10 {

		static std::atomic<std::uint64_t> next_unique_storage_id{0};

[package] add unique_id to StorageImpl and use as storage identifier #59488

[package] add unique_id to StorageImpl and use as storage identifier #59488

Uh oh!

Conversation

lobstergrindset commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Uh oh!

lobstergrindset commented Jun 4, 2021

Uh oh!

dreiss left a comment

Choose a reason for hiding this comment

Uh oh!

dreiss Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

dreiss Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

kimishpatel commented Jun 4, 2021

Uh oh!

lobstergrindset commented Jun 4, 2021

Uh oh!

kimishpatel commented Jun 4, 2021

Uh oh!

raziel Jun 5, 2021

Choose a reason for hiding this comment

Uh oh!

dhruvbird commented Jun 6, 2021

Uh oh!

dhruvbird commented Jun 6, 2021

Uh oh!

dhruvbird commented Jun 6, 2021

Uh oh!

kimishpatel commented Jun 6, 2021

Uh oh!

kimishpatel commented Jun 6, 2021

Uh oh!

dhruvbird commented Jun 7, 2021

Uh oh!

suo Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

dhruvbird commented Jun 7, 2021

Uh oh!

kimishpatel commented Jun 7, 2021

Uh oh!

lobstergrindset commented Jun 7, 2021

Uh oh!

dreiss commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jun 8, 2021

Uh oh!

suo commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dreiss commented Jun 8, 2021

Uh oh!

lobstergrindset commented Jun 8, 2021

Uh oh!

kimishpatel commented Jun 9, 2021

Uh oh!

lobstergrindset commented Jun 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

lobstergrindset commented Jun 4, 2021 •

edited

Loading

facebook-github-bot commented Jun 4, 2021 •

edited

Loading

dreiss commented Jun 8, 2021 •

edited

Loading

suo commented Jun 8, 2021 •

edited

Loading