-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] PyTorch and TVM loading problem due to conflicting LLVM symbols #9362
Comments
Would be good to find out what is the symbol that get conflicted((perhaps by linking things together)) and resolve it(rename the symbol in tvm side if possible). Note that the same problem will appear in the future if we really make an attempt to link pytorch in a deeper integration. This would serve as a way to resolve the possible issue.
|
To followup a bit on this, we had a previous conflict with DGL which ends up to be DLPack related, and we moved away by prefix TVM to those symbols. Turn on https://github.com/apache/tvm/blob/main/CMakeLists.txt#L46 would also help alleviate the issue, since the visible symbols will only reduce to those that are related to TVM_DLL. I would watch carefully those C symbols, since most symbols are in tvm namespace and should be fine. |
I can confirm that |
@masahi can you also confirm what is the symbol? |
I built
Looks like I need to dig deep. I agree that we should fix this problem for deeper PT + TVM integration in the future. |
Hmm strange, on the environment I tried |
|
It would be great to try gdb and catch the backtrace, normally it will give some evidence of where things went wrong |
Here's the backtrace I receive from gdb:
When running:
Is this of any help? |
With the trivial code,
I get this useless backtrace
|
OK, digged a bit into this. I think I know the possible cause. This is because of the conflict of LLVM symbols(due to different versions of LLVM being used). PyTorch also starts to ship with LLVM. To avoid the problem, we need to do two things
I did a quick experiment locally and when we turn both options ON, things are good, and there will be conflict with either option off. |
Thanks @tqchen, I confirmed that your solution worked on both of my envrionements too, and also both static link and Also I realized that when I said "I cannot reproduce the original failure anymore" in #9362 (comment), my cmake config is pointing to a different, custom LLVM build that has only static libs. Moreover, apparently these custom libs were built in a way that So no mystery on my end anymore. I'm going to update the install doc to include this tip. |
@tqchen I modified the CMakeLists.txt, tvm_option(USE_LLVM "/usr/bin/llvm-config --link-static" ON) tvm_option(HIDE_PRIVATE_SYMBOLS "Compile with -fvisibility=hidden." ON) But I still found the bug "free(): invalid pointer", |
@Jie-KUN you need to set those configurations in config.cmake instead of CMakeLists.txt |
@tqchen , thank you sincerely. I still have a question that I tried the code "from_pytorch.py" from the tutorial. But I always found the tips: "One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details." Is that normal? |
Yes that's normal. Please post other questions to the discuss forum. |
@masahi Ok, thank you! |
* This is to workaround an issue caused by conflicting LLVM versions, first observed by since we updated Pytorch in TVM * Discussion at: apache/tvm#9362
Thanks for letting us know. It seems that currently, |
* This is to workaround an issue caused by conflicting LLVM versions, first observed by since we updated Pytorch in TVM * Discussion at: apache/tvm#9362
* This is to workaround an issue caused by conflicting LLVM versions, first observed by since we updated Pytorch in TVM * Discussion at: apache/tvm#9362
This test was originally disabled due to the issue documented in apache#7455 affecting CI. I believe this has since been resolved by apache#9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
This test was originally disabled due to the issue documented in apache#7455 affecting CI. I believe this has since been resolved by apache#9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
This test was originally disabled due to the issue documented in #7455 affecting CI. I believe this has since been resolved by #9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a test to check quantized pytorch mobilenetv2 is converted correctly. Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
This test was originally disabled due to the issue documented in apache#7455 affecting CI. I believe this has since been resolved by apache#9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a test to check quantized pytorch mobilenetv2 is converted correctly. Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
This test was originally disabled due to the issue documented in apache#7455 affecting CI. I believe this has since been resolved by apache#9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a test to check quantized pytorch mobilenetv2 is converted correctly. Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
This test was originally disabled due to the issue documented in apache#7455 affecting CI. I believe this has since been resolved by apache#9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a test to check quantized pytorch mobilenetv2 is converted correctly. Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a test to check quantized pytorch mobilenetv2 is converted correctly. Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
This test was originally disabled due to the issue documented in apache#7455 affecting CI. I believe this has since been resolved by apache#9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a test to check quantized pytorch mobilenetv2 is converted correctly. Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
This fixes: Set hide private symbols to on to avoid the following error: free(): invalid pointer Aborted (core dumped) Reference: apache/tvm#9362
This fixes: Set hide private symbols to on to avoid the following error: free(): invalid pointer Aborted (core dumped) Reference: apache/tvm#9362 (cherry picked from commit d4e4edea7d97a1c36b69e6d88dbde9cbf2bc55b4)
Apparently, the new PyTorch release crashes with symbols loaded by TVM, so the following trivial code crashes with
invalid pointer Aborted (core dumped)
upon exit:We can workaround this by swapping the import order, but as pointed out in #9349 (comment) this may not always be possible.
Another solution is to remove the use of
RTLD_GLOBAL
intvm/python/tvm/_ffi/base.py
Line 57 in dfe4ceb
See related issues in other repos that moved away from using
RTLD_GLOBAL
.dmlc/dgl#2255
pytorch/pytorch#28536
pytorch/pytorch#3059
Is there any particular reason we are using
RTLD_GLOBAL
? @tqchen @areuschThe text was updated successfully, but these errors were encountered: