Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] PyTorch and TVM loading problem due to conflicting LLVM symbols #9362

Closed
masahi opened this issue Oct 25, 2021 · 20 comments · Fixed by #9433
Closed

[Bug] PyTorch and TVM loading problem due to conflicting LLVM symbols #9362

masahi opened this issue Oct 25, 2021 · 20 comments · Fixed by #9433

Comments

@masahi
Copy link
Member

masahi commented Oct 25, 2021

Apparently, the new PyTorch release crashes with symbols loaded by TVM, so the following trivial code crashes with invalid pointer Aborted (core dumped) upon exit:

import tvm
import torch

We can workaround this by swapping the import order, but as pointed out in #9349 (comment) this may not always be possible.

Another solution is to remove the use of RTLD_GLOBAL in

lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)

See related issues in other repos that moved away from using RTLD_GLOBAL.
dmlc/dgl#2255
pytorch/pytorch#28536
pytorch/pytorch#3059

Is there any particular reason we are using RTLD_GLOBAL? @tqchen @areusch

@tqchen
Copy link
Member

tqchen commented Oct 25, 2021

Would be good to find out what is the symbol that get conflicted((perhaps by linking things together)) and resolve it(rename the symbol in tvm side if possible). Note that the same problem will appear in the future if we really make an attempt to link pytorch in a deeper integration. This would serve as a way to resolve the possible issue.

RTLD_GLOBAL provides some convenience to give plugin modules(that are loaded later) symbols of libtvm_runtime without explicitly linking to it, we might need to rethink the plugin mechanism(e.g. vta) a bit if we decided to move away from it.

@tqchen
Copy link
Member

tqchen commented Oct 25, 2021

To followup a bit on this, we had a previous conflict with DGL which ends up to be DLPack related, and we moved away by prefix TVM to those symbols.

Turn on https://github.com/apache/tvm/blob/main/CMakeLists.txt#L46 would also help alleviate the issue, since the visible symbols will only reduce to those that are related to TVM_DLL.

I would watch carefully those C symbols, since most symbols are in tvm namespace and should be fine.

@masahi
Copy link
Member Author

masahi commented Oct 25, 2021

I can confirm that HIDE_PRIVATE_SYMBOLS=ON also fixes it. I think this is a good enough workaround for now cc @lhutton1 .

@masahi masahi closed this as completed Oct 25, 2021
@tqchen
Copy link
Member

tqchen commented Oct 25, 2021

@masahi can you also confirm what is the symbol?

@masahi
Copy link
Member Author

masahi commented Oct 26, 2021

I built libtvm.so with pytorch libs, no error occurred.

$ ldd libtvm.so 
	linux-vdso.so.1 (0x00007ffcd07dd000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffb946a8000)
	libtorch_cpu.so => /home/masa/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so (0x00007ffb7ec58000)
	libc10.so => /home/masa/anaconda3/lib/python3.8/site-packages/torch/lib/libc10.so (0x00007ffb7ebd2000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ffb7e9f0000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ffb7e8a1000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ffb7e886000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffb7e861000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffb7e66f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffb9644f000)
	libgomp-a34b3233.so.1 => /home/masa/anaconda3/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1 (0x00007ffb7e445000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ffb7e43a000)

Looks like I need to dig deep. I agree that we should fix this problem for deeper PT + TVM integration in the future.

@masahi
Copy link
Member Author

masahi commented Oct 26, 2021

Hmm strange, on the environment I tried HIDE_PRIVATE_SYMBOLS=ON above, I cannot reproduce the original failure anymore. And on the other environment, HIDE_PRIVATE_SYMBOLS=ON didn't fix the problem.

@masahi masahi reopened this Oct 26, 2021
@masahi masahi changed the title [Bug] Stop using RTLD_GLOBAL [Bug] Stop using RTLD_GLOBAL or fix symbol crash with PyTorch by other means Oct 27, 2021
@lhutton1
Copy link
Contributor

set(HIDE_PRIVATE_SYMBOLS ON) didn't seem to work for me either :/

@tqchen
Copy link
Member

tqchen commented Oct 28, 2021

It would be great to try gdb and catch the backtrace, normally it will give some evidence of where things went wrong

@lhutton1
Copy link
Contributor

Here's the backtrace I receive from gdb:

(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7a22921 in __GI_abort () at abort.c:79
#2  0x00007ffff7a6b967 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7b98b0d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff7a729da in malloc_printerr (str=str@entry=0x7ffff7b9a720 "munmap_chunk(): invalid pointer") at malloc.c:5342
#4  0x00007ffff7a79fbc in munmap_chunk (p=0x7fffffffbc18) at malloc.c:2846
#5  __GI___libc_free (mem=0x7fffffffbc28) at malloc.c:3127
#6  0x00007fff1dcafe86 in std::__detail::_Compiler<std::regex_traits<char> >::_Compiler(char const*, char const*, std::locale const&, std::regex_constants::syntax_option_type) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#7  0x00007fff1debc1c0 in torch::jit::SourceImporterImpl::attributeAssignmentSpecialHandlingHack(c10::QualifiedName const&, torch::jit::Assign const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#8  0x00007fff1debed4a in torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#9  0x00007fff1dec0313 in torch::jit::SourceImporterImpl::importNamedType(std::string const&, torch::jit::ClassDef const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#10 0x00007fff1dec08d1 in torch::jit::SourceImporterImpl::resolveType(std::string const&, torch::jit::SourceRange const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#11 0x00007fff1dc36668 in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so

When running:

import tvm
import torch
torch.jit.load(<path-to-any-model>)

Is this of any help?

@masahi
Copy link
Member Author

masahi commented Oct 29, 2021

With the trivial code,

import tvm
import torch

I get this useless backtrace

free(): invalid pointer

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7db7859 in __GI_abort () at abort.c:79
#2  0x00007ffff7e223ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7f4c285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff7e2a47c in malloc_printerr (str=str@entry=0x7ffff7f4a4ae "free(): invalid pointer") at malloc.c:5347
#4  0x00007ffff7e2bcac in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173
#5  0x00007fffcfe92859 in ?? () from /lib/x86_64-linux-gnu/libLLVM-10.so.1
#6  0x00007ffff7ddba27 in __run_exit_handlers (status=0, listp=0x7ffff7f7d718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:108
#7  0x00007ffff7ddbbe0 in __GI_exit (status=<optimized out>) at exit.c:139
#8  0x00007ffff7db90ba in __libc_start_main (main=0x55555566d460 <main>, argc=2, argv=0x7fffffffd4d8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffd4c8) at ../csu/libc-start.c:342
#9  0x000055555573afe5 in _start () at ../sysdeps/x86_64/elf/start.S:103

@tqchen
Copy link
Member

tqchen commented Oct 30, 2021

OK, digged a bit into this. I think I know the possible cause. This is because of the conflict of LLVM symbols(due to different versions of LLVM being used). PyTorch also starts to ship with LLVM. To avoid the problem, we need to do two things

  • Turn on static linking of LLVM, this will directly link llvm code into libtvm without relying on dynamic library (that creates global symbols)
    • set(USE_LLVM "/path/to/llvm-config --link-static")
  • Turn on set(HIDE_PRIVATE_SYMBOLS ON). This will effectively hide the LLVM related symbols when we load globally from pytorch.

I did a quick experiment locally and when we turn both options ON, things are good, and there will be conflict with either option off.

@masahi
Copy link
Member Author

masahi commented Oct 30, 2021

Thanks @tqchen, I confirmed that your solution worked on both of my envrionements too, and also both static link and HIDE_PRIVATE_SYMBOLS are required.

Also I realized that when I said "I cannot reproduce the original failure anymore" in #9362 (comment), my cmake config is pointing to a different, custom LLVM build that has only static libs. Moreover, apparently these custom libs were built in a way that HIDE_PRIVATE_SYMBOLS doesn't need to be enabled.

So no mystery on my end anymore.

I'm going to update the install doc to include this tip.

@Jie-KUN
Copy link

Jie-KUN commented Oct 31, 2021

@tqchen I modified the CMakeLists.txt,

tvm_option(USE_LLVM "/usr/bin/llvm-config --link-static" ON)

tvm_option(HIDE_PRIVATE_SYMBOLS "Compile with -fvisibility=hidden." ON)

But I still found the bug "free(): invalid pointer",

@tqchen
Copy link
Member

tqchen commented Oct 31, 2021

@Jie-KUN you need to set those configurations in config.cmake instead of CMakeLists.txt

@tqchen tqchen changed the title [Bug] Stop using RTLD_GLOBAL or fix symbol crash with PyTorch by other means [Bug] PyTorch and TVM loading problem due to conflicting LLVM symbols Oct 31, 2021
@Jie-KUN
Copy link

Jie-KUN commented Nov 1, 2021

@tqchen , thank you sincerely. I still have a question that I tried the code "from_pytorch.py" from the tutorial. But I always found the tips:

"One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details."

Is that normal?

@masahi
Copy link
Member Author

masahi commented Nov 1, 2021

Yes that's normal. Please post other questions to the discuss forum.

@Jie-KUN
Copy link

Jie-KUN commented Nov 1, 2021

@masahi Ok, thank you!

@tqchen
Copy link
Member

tqchen commented Nov 1, 2021

cc @leandron @areusch for awareness, let us update tlcpack config

leandron added a commit to leandron/tlcpack that referenced this issue Nov 1, 2021
* This is to workaround an issue caused by conflicting LLVM
  versions, first observed by since we updated Pytorch in TVM

* Discussion at: apache/tvm#9362
@leandron
Copy link
Contributor

leandron commented Nov 1, 2021

* Turn on static linking of LLVM, this will directly link llvm code into libtvm without relying on dynamic library (that creates global symbols)
  * `set(USE_LLVM "/path/to/llvm-config --link-static")`
* Turn on `set(HIDE_PRIVATE_SYMBOLS ON)`. This will effectively hide the LLVM related symbols when we load globally from pytorch.

Thanks for letting us know. It seems that currently, --link-static is already there in tlcpack. I added tlc-pack/tlcpack#81 for the workaround discussed here.

leandron added a commit to leandron/tlcpack that referenced this issue Nov 2, 2021
* This is to workaround an issue caused by conflicting LLVM
  versions, first observed by since we updated Pytorch in TVM

* Discussion at: apache/tvm#9362
tqchen pushed a commit to tlc-pack/tlcpack that referenced this issue Nov 2, 2021
* This is to workaround an issue caused by conflicting LLVM
  versions, first observed by since we updated Pytorch in TVM

* Discussion at: apache/tvm#9362
lhutton1 added a commit to lhutton1/tvm that referenced this issue Nov 4, 2021
This test was originally disabled due to the issue documented in apache#7455
affecting CI. I believe this has since been resolved by apache#9362.

Note: This patch should not be merged until the changes in
https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI.

Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
lhutton1 added a commit to lhutton1/tvm that referenced this issue Nov 5, 2021
This test was originally disabled due to the issue documented in apache#7455
affecting CI. I believe this has since been resolved by apache#9362.

Note: This patch should not be merged until the changes in
https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI.

Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
masahi pushed a commit that referenced this issue Nov 6, 2021
This test was originally disabled due to the issue documented in #7455
affecting CI. I believe this has since been resolved by #9362.

Note: This patch should not be merged until the changes in
https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI.

Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
lhutton1 added a commit to lhutton1/tvm that referenced this issue Nov 8, 2021
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a
test to check quantized pytorch mobilenetv2 is converted correctly.

Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
leandron pushed a commit that referenced this issue Nov 9, 2021
As a follow up to #9417 and now that #9362 is resolved, this PR adds a
test to check quantized pytorch mobilenetv2 is converted correctly.

Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
mehrdadh pushed a commit to mehrdadh/tvm that referenced this issue Dec 1, 2021
This test was originally disabled due to the issue documented in apache#7455
affecting CI. I believe this has since been resolved by apache#9362.

Note: This patch should not be merged until the changes in
https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI.

Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
mehrdadh pushed a commit to mehrdadh/tvm that referenced this issue Dec 1, 2021
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a
test to check quantized pytorch mobilenetv2 is converted correctly.

Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
mehrdadh pushed a commit to mehrdadh/tvm that referenced this issue Dec 1, 2021
This test was originally disabled due to the issue documented in apache#7455
affecting CI. I believe this has since been resolved by apache#9362.

Note: This patch should not be merged until the changes in
https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI.

Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
mehrdadh pushed a commit to mehrdadh/tvm that referenced this issue Dec 1, 2021
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a
test to check quantized pytorch mobilenetv2 is converted correctly.

Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
ylc pushed a commit to ylc/tvm that referenced this issue Jan 7, 2022
This test was originally disabled due to the issue documented in apache#7455
affecting CI. I believe this has since been resolved by apache#9362.

Note: This patch should not be merged until the changes in
https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI.

Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
ylc pushed a commit to ylc/tvm that referenced this issue Jan 7, 2022
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a
test to check quantized pytorch mobilenetv2 is converted correctly.

Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
yangulei pushed a commit to yangulei/tvm that referenced this issue Jan 11, 2022
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a
test to check quantized pytorch mobilenetv2 is converted correctly.

Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
ylc pushed a commit to ylc/tvm that referenced this issue Jan 13, 2022
This test was originally disabled due to the issue documented in apache#7455
affecting CI. I believe this has since been resolved by apache#9362.

Note: This patch should not be merged until the changes in
https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI.

Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13
ylc pushed a commit to ylc/tvm that referenced this issue Jan 13, 2022
As a follow up to apache#9417 and now that apache#9362 is resolved, this PR adds a
test to check quantized pytorch mobilenetv2 is converted correctly.

Change-Id: Iaf2d38ce71c008e0141a4a2536bd54c2c9f3fe3d
cgerum added a commit to ekut-es/hannah-tvm that referenced this issue Sep 27, 2022
This fixes:

Set hide private symbols to on to avoid the following error:
 free(): invalid pointer
 Aborted (core dumped)

Reference: apache/tvm#9362
cgerum added a commit to ekut-es/hannah-tvm that referenced this issue Sep 27, 2022
This fixes:

Set hide private symbols to on to avoid the following error:
 free(): invalid pointer
 Aborted (core dumped)

Reference: apache/tvm#9362


(cherry picked from commit d4e4edea7d97a1c36b69e6d88dbde9cbf2bc55b4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants