Skip to content

Conversation

@lobstergrindset
Copy link

@lobstergrindset lobstergrindset commented Jun 8, 2021

Stack from ghstack:

Uses StorageContext to hold a reference to all storages seen during TorchScript serialization to allow for tensors to be created/destroyed during serialization process. Tracking of the storages solves for the ABA memory problem.

Differential Revision: D28968947

@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Jun 8, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 8, 2021

💊 CI failures summary and remediations

As of commit 7f61ba2 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 08 23:53:59 test_udf_remote_message_delay...yUniqueId(created_on=0, local_id=0) to be created.
Jun 08 23:53:15 frame #13: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7febcd6491d3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:15 frame #14: <unknown function> + 0xc8421 (0x7febc20fd421 in /opt/conda/lib/libstdc++.so.6)
Jun 08 23:53:15 frame #15: <unknown function> + 0x76ba (0x7febdcad56ba in /lib/x86_64-linux-gnu/libpthread.so.0)
Jun 08 23:53:15 frame #16: clone + 0x6d (0x7febdc80b51d in /lib/x86_64-linux-gnu/libc.so.6)
Jun 08 23:53:15 
Jun 08 23:53:15 ok (4.457s)
Jun 08 23:53:31   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (15.993s)
Jun 08 23:53:41   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (10.082s)
Jun 08 23:53:46   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (4.456s)
Jun 08 23:53:54   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.568s)
Jun 08 23:53:59   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:552] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jun 08 23:53:59 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Jun 08 23:53:59 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f2a64241499 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:59 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f2a6423d5d2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:59 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f2a6423ef2e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 08 23:53:59 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f2a5cde2504 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 08 23:53:59 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x71 (0x7f2a5cdd24a1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 08 23:53:59 frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f2a6559fcf8 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 08 23:53:59 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f2a5cdd6fd4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 08 23:53:59 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f2a6559f2f5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 08 23:53:59 frame #8: <unknown function> + 0x405cb6a (0x7f2a5cdd3b6a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

1 job timed out:

  • pytorch_linux_xenial_py3_6_gcc5_4_test

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lobstergrindset pushed a commit that referenced this pull request Jun 8, 2021
@lobstergrindset
Copy link
Author

@Lilyjjo has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

lobstergrindset pushed a commit that referenced this pull request Jun 8, 2021
@lobstergrindset
Copy link
Author

@Lilyjjo has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@lobstergrindset lobstergrindset requested review from raziel and suo June 8, 2021 18:29
@cccclai
Copy link
Contributor

cccclai commented Jun 8, 2021

Can this pr be cherry-picked? Otherwise heavy save/load process could be flaky

".storage");
tensor.storage().unsafeGetStorageImpl()));
tensor_names.push_back(string_id + ".storage");
storage_context_.addStorage(string_id, tensor.storage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's a great idea to use StorageContext. The storage_context_ lives in the life span of the serializer so it should be safe for any tensor serialization. @cccclai could you test if it works for bytecode v5 serialization with quantized models? It looks different than #59488 that you've already tested.

Copy link
Contributor

@cccclai cccclai Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test it by re-introduce #58629 and use the following code to test:

for i in range(20):
    print(i)
    m = torch.jit.load(model_path)
    m._save_for_lite_interpreter(model_resave_path)
    mm = _load_for_lite_interpreter(model_resave_path)

image

std::to_string(reinterpret_cast<std::intptr_t>(
tensor.storage().unsafeGetStorageImpl())) +
".storage");
tensor.storage().unsafeGetStorageImpl()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only caveat of using StorageImpl is that the id may change from run to run. Should not be a real issue, but for metadata comparison on data.pkl or bytecode.pkl, the tensor metadata may be different literally, although the tensor content does not change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the loss of determinism in the naming is not the best. Others were wanting this to be fixed as well. I can address it in a follow up PR

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from mobile side.

@lobstergrindset
Copy link
Author

lobstergrindset commented Jun 8, 2021

Do we want to try to get this PR into the 1.9 release? I saw that people were still adding to the branch earlier today #58518

It would fix the serialization issues I think that are currently present in the release

@cccclai
Copy link
Contributor

cccclai commented Jun 8, 2021

Do we want to try to get this PR into the 1.9 release? I saw that people were still adding to the branch earlier today #58518

It would fix the serialization issues I think that are currently present in the release

Let's try. Can you comment under the tracker? I will also comment there as well

Copy link
Contributor

@iseeyuan iseeyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@facebook-github-bot
Copy link
Contributor

@Lilyjjo merged this pull request in 3271853.

deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
…#59642)

Summary: Pull Request resolved: pytorch#59642

Test Plan: Imported from OSS

Reviewed By: jbschlosser, cccclai

Differential Revision: D28968947

Pulled By: Lilyjjo

fbshipit-source-id: 0046da8adb3a29fb108965a1d2201749fe2d0b41
cccclai added a commit that referenced this pull request Jun 9, 2021
Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 

Test it by:
```
model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA.ptl"
# model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GICWmAARE1pJ3yQEAFk58lqFfzVGbgdIAAAi.ptl"
model_resave_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA_resave.ptl"

for i in range(20):
    print(i)
    m = torch.jit.load(model_path)
    m._save_for_lite_interpreter(model_resave_path)
    mm = _load_for_lite_interpreter(model_resave_path)
```


Differential Revision: [D29002345](https://our.internmc.facebook.com/intern/diff/D29002345)

[ghstack-poisoned]
cccclai added a commit that referenced this pull request Jun 10, 2021
Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 

Test it by:
```
model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA.ptl"
# model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GICWmAARE1pJ3yQEAFk58lqFfzVGbgdIAAAi.ptl"
model_resave_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA_resave.ptl"

for i in range(20):
    print(i)
    m = torch.jit.load(model_path)
    m._save_for_lite_interpreter(model_resave_path)
    mm = _load_for_lite_interpreter(model_resave_path)
```


Differential Revision: [D29002345](https://our.internmc.facebook.com/intern/diff/D29002345)

[ghstack-poisoned]
cccclai added a commit that referenced this pull request Jun 10, 2021
Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 

Test it by:
```
model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA.ptl"
# model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GICWmAARE1pJ3yQEAFk58lqFfzVGbgdIAAAi.ptl"
model_resave_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA_resave.ptl"

for i in range(20):
    print(i)
    m = torch.jit.load(model_path)
    m._save_for_lite_interpreter(model_resave_path)
    mm = _load_for_lite_interpreter(model_resave_path)
```


Differential Revision: [D29002345](https://our.internmc.facebook.com/intern/diff/D29002345)

[ghstack-poisoned]
cccclai added a commit that referenced this pull request Jun 10, 2021
Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 

Test it by:
```
model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA.ptl"
# model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GICWmAARE1pJ3yQEAFk58lqFfzVGbgdIAAAi.ptl"
model_resave_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA_resave.ptl"

for i in range(20):
    print(i)
    m = torch.jit.load(model_path)
    m._save_for_lite_interpreter(model_resave_path)
    mm = _load_for_lite_interpreter(model_resave_path)
```


Differential Revision: [D29002345](https://our.internmc.facebook.com/intern/diff/D29002345)

[ghstack-poisoned]
cccclai added a commit that referenced this pull request Jun 10, 2021
Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 

Test it by:
```
model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA.ptl"
# model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GICWmAARE1pJ3yQEAFk58lqFfzVGbgdIAAAi.ptl"
model_resave_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA_resave.ptl"

for i in range(20):
    print(i)
    m = torch.jit.load(model_path)
    m._save_for_lite_interpreter(model_resave_path)
    mm = _load_for_lite_interpreter(model_resave_path)
```


Differential Revision: [D29002345](https://our.internmc.facebook.com/intern/diff/D29002345)

[ghstack-poisoned]
cccclai added a commit that referenced this pull request Jun 10, 2021
Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642 

Test it by:
```
model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA.ptl"
# model_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GICWmAARE1pJ3yQEAFk58lqFfzVGbgdIAAAi.ptl"
model_resave_path = "/Users/chenlai/Documents/pytorch/reuse_constant/tmp/other_models/model_GKzegAqmfRoYTygDACAlZhsBGFcRbmQwAAAA_resave.ptl"

for i in range(20):
    print(i)
    m = torch.jit.load(model_path)
    m._save_for_lite_interpreter(model_resave_path)
    mm = _load_for_lite_interpreter(model_resave_path)
```


Differential Revision: [D29002345](https://our.internmc.facebook.com/intern/diff/D29002345)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jun 10, 2021
Summary:
Pull Request resolved: #59722

Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29002345

Pulled By: cccclai

fbshipit-source-id: d9c8e474ff57d0509580183206df038a24ad27e3
malfet pushed a commit that referenced this pull request Jun 11, 2021
Fixes issue for serialization problem caused by using memory address of storages for mobile and torch.package models.

 - #59642 hold references to storages during TorchScript serialization

Uses StorageContext to hold a reference to all storages seen during TorchScript serialization to allow for tensors to be created/destroyed during serialization process. Tracking of the storages solves for the ABA memory problem.
@facebook-github-bot facebook-github-bot deleted the gh/Lilyjjo/85/head branch June 13, 2021 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants