hold references to storages during TorchScript serializaiton #59642

lobstergrindset · 2021-06-08T18:20:52Z

Stack from ghstack:

hold references to storages during TorchScript serializaiton #59642 hold references to storages during TorchScript serializaiton

Uses StorageContext to hold a reference to all storages seen during TorchScript serialization to allow for tensors to be created/destroyed during serialization process. Tracking of the storages solves for the ABA memory problem.

Differential Revision: D28968947

[ghstack-poisoned]

facebook-github-bot · 2021-06-08T18:20:57Z

💊 CI failures summary and remediations

As of commit 7f61ba2 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 08 23:53:59 test_udf_remote_message_delay...yUniqueId(created_on=0, local_id=0) to be created.

Jun 08 23:53:15 frame #13: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7febcd6491d3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:15 frame #14: <unknown function> + 0xc8421 (0x7febc20fd421 in /opt/conda/lib/libstdc++.so.6)
Jun 08 23:53:15 frame #15: <unknown function> + 0x76ba (0x7febdcad56ba in /lib/x86_64-linux-gnu/libpthread.so.0)
Jun 08 23:53:15 frame #16: clone + 0x6d (0x7febdc80b51d in /lib/x86_64-linux-gnu/libc.so.6)
Jun 08 23:53:15 
Jun 08 23:53:15 ok (4.457s)
Jun 08 23:53:31   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (15.993s)
Jun 08 23:53:41   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (10.082s)
Jun 08 23:53:46   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (4.456s)
Jun 08 23:53:54   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.568s)
Jun 08 23:53:59   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:552] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jun 08 23:53:59 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Jun 08 23:53:59 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f2a64241499 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:59 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f2a6423d5d2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:59 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f2a6423ef2e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:59 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f2a5cde2504 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 08 23:53:59 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x71 (0x7f2a5cdd24a1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 08 23:53:59 frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f2a6559fcf8 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 08 23:53:59 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f2a5cdd6fd4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 08 23:53:59 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f2a6559f2f5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 08 23:53:59 frame #8: <unknown function> + 0x405cb6a (0x7f2a5cdd3b6a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

1 job timed out:

pytorch_linux_xenial_py3_6_gcc5_4_test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ghstack-source-id: f99aec9 Pull Request resolved: #59642

lobstergrindset · 2021-06-08T18:21:59Z

@Lilyjjo has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D28968947](https://our.internmc.facebook.com/intern/diff/D28968947) [ghstack-poisoned]

ghstack-source-id: 5da2e29 Pull Request resolved: #59642

lobstergrindset · 2021-06-08T18:26:36Z

@Lilyjjo has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

cccclai · 2021-06-08T19:12:33Z

Can this pr be cherry-picked? Otherwise heavy save/load process could be flaky

iseeyuan · 2021-06-08T19:13:02Z

torch/csrc/jit/serialization/export_module.cpp

-              ".storage");
+                  tensor.storage().unsafeGetStorageImpl()));
+          tensor_names.push_back(string_id + ".storage");
+          storage_context_.addStorage(string_id, tensor.storage());


I feel it's a great idea to use StorageContext. The storage_context_ lives in the life span of the serializer so it should be safe for any tensor serialization. @cccclai could you test if it works for bytecode v5 serialization with quantized models? It looks different than #59488 that you've already tested.

Test it by re-introduce #58629 and use the following code to test:

for i in range(20): print(i) m = torch.jit.load(model_path) m._save_for_lite_interpreter(model_resave_path) mm = _load_for_lite_interpreter(model_resave_path)

iseeyuan · 2021-06-08T20:09:37Z

torch/csrc/jit/serialization/export_module.cpp

              std::to_string(reinterpret_cast<std::intptr_t>(
-                  tensor.storage().unsafeGetStorageImpl())) +
-              ".storage");
+                  tensor.storage().unsafeGetStorageImpl()));


I guess the only caveat of using StorageImpl is that the id may change from run to run. Should not be a real issue, but for metadata comparison on data.pkl or bytecode.pkl, the tensor metadata may be different literally, although the tensor content does not change.

Yes, the loss of determinism in the naming is not the best. Others were wanting this to be fixed as well. I can address it in a follow up PR

cccclai

LGTM from mobile side.

lobstergrindset · 2021-06-08T22:53:41Z

Do we want to try to get this PR into the 1.9 release? I saw that people were still adding to the branch earlier today #58518

It would fix the serialization issues I think that are currently present in the release

cccclai · 2021-06-08T22:58:14Z

Do we want to try to get this PR into the 1.9 release? I saw that people were still adding to the branch earlier today #58518

It would fix the serialization issues I think that are currently present in the release

Let's try. Can you comment under the tracker? I will also comment there as well

iseeyuan

LGTM. Thanks!

facebook-github-bot · 2021-06-09T17:13:42Z

@Lilyjjo merged this pull request in 3271853.

…#59642) Summary: Pull Request resolved: pytorch#59642 Test Plan: Imported from OSS Reviewed By: jbschlosser, cccclai Differential Revision: D28968947 Pulled By: Lilyjjo fbshipit-source-id: 0046da8adb3a29fb108965a1d2201749fe2d0b41

Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 Test it by: ``` model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA.ptl" # model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GICWmAARE1pJ3yQEAFk58lqFfzVGbgdIAAAi.ptl" model_resave_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA_resave.ptl" for i in range(20): print(i) m = torch.jit.load(model_path) m._save_for_lite_interpreter(model_resave_path) mm = _load_for_lite_interpreter(model_resave_path) ``` Differential Revision: [D29002345](https://our.internmc.facebook.com/intern/diff/D29002345) [ghstack-poisoned]

Summary: Pull Request resolved: #59722 Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 Test Plan: Imported from OSS Reviewed By: iseeyuan Differential Revision: D29002345 Pulled By: cccclai fbshipit-source-id: d9c8e474ff57d0509580183206df038a24ad27e3

Fixes issue for serialization problem caused by using memory address of storages for mobile and torch.package models. - #59642 hold references to storages during TorchScript serialization Uses StorageContext to hold a reference to all storages seen during TorchScript serialization to allow for tensors to be created/destroyed during serialization process. Tracking of the storages solves for the ABA memory problem.

hold references to storages during TorchScript serializaiton

d427e44

[ghstack-poisoned]

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Jun 8, 2021

lobstergrindset pushed a commit that referenced this pull request Jun 8, 2021

hold references to storages during TorchScript serializaiton

fb17d9b

ghstack-source-id: f99aec9 Pull Request resolved: #59642

Update on "hold references to storages during TorchScript serializaiton"

7f61ba2

Differential Revision: [D28968947](https://our.internmc.facebook.com/intern/diff/D28968947) [ghstack-poisoned]

lobstergrindset pushed a commit that referenced this pull request Jun 8, 2021

hold references to storages during TorchScript serializaiton

56b6222

ghstack-source-id: 5da2e29 Pull Request resolved: #59642

lobstergrindset requested review from raziel and suo June 8, 2021 18:29

iseeyuan reviewed Jun 8, 2021

View reviewed changes

lobstergrindset mentioned this pull request Jun 8, 2021

[package] add unique_id to StorageImpl and use as storage identifier #59488

Closed

iseeyuan reviewed Jun 8, 2021

View reviewed changes

lobstergrindset mentioned this pull request Jun 8, 2021

hold references to storages during TorchScript serializaiton #59672

Merged

cccclai approved these changes Jun 8, 2021

View reviewed changes

lobstergrindset mentioned this pull request Jun 8, 2021

[v1.9.0] Release Tracker #58518

Closed

iseeyuan approved these changes Jun 9, 2021

View reviewed changes

facebook-github-bot closed this in 3271853 Jun 9, 2021

facebook-github-bot added the Merged label Jun 9, 2021

cccclai mentioned this pull request Jun 9, 2021

[After fix] Reuse constant and bump bytecode to v5 #59722

Closed

facebook-github-bot deleted the gh/Lilyjjo/85/head branch June 13, 2021 14:15

hold references to storages during TorchScript serializaiton #59642

hold references to storages during TorchScript serializaiton #59642

Uh oh!

Conversation

lobstergrindset commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Uh oh!

lobstergrindset commented Jun 8, 2021

Uh oh!

lobstergrindset commented Jun 8, 2021

Uh oh!

cccclai commented Jun 8, 2021

Uh oh!

iseeyuan Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

cccclai Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iseeyuan Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

lobstergrindset Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

lobstergrindset commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cccclai commented Jun 8, 2021

Uh oh!

iseeyuan left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lobstergrindset commented Jun 8, 2021 •

edited

Loading

facebook-github-bot commented Jun 8, 2021 •

edited

Loading

cccclai Jun 8, 2021 •

edited

Loading

lobstergrindset commented Jun 8, 2021 •

edited

Loading