Skip to content

Conversation

@zhyncs
Copy link
Member

@zhyncs zhyncs commented Aug 7, 2024

fix cc @yzh119

 /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crti.o: in function `_init':
    (.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
    /flashinfer/python/build/temp.linux-x86_64-cpython-310/csrc/batch_decode.o: in function `__cudaUnregisterBinaryUtil()':
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x1d7): relocation truncated to fit: R_X86_64_PC32 against `.bss'
    /flashinfer/python/build/temp.linux-x86_64-cpython-310/csrc/batch_decode.o: in function `std::string::_Rep::_M_dispose(std::allocator<char> const&) [clone .part.0]':
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x1e3): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `__libc_single_threaded@@GLIBC_2.32' defined in .bss section in /lib/x86_64-linux-gnu/libc.so.6
    /flashinfer/python/build/temp.linux-x86_64-cpython-310/csrc/batch_decode.o: in function `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*) [clone .constprop.0]':
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x237): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `std::cerr@@GLIBCXX_3.4' defined in .bss section in /usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x25b): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `std::cerr@@GLIBCXX_3.4' defined in .bss section in /usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
    /flashinfer/python/build/temp.linux-x86_64-cpython-310/csrc/batch_decode.o: in function `void* flashinfer::AlignedAllocator::aligned_alloc<void>(unsigned long, unsigned long, std::string) [clone .constprop.0]':
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x30a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `vtable for std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >@@GLIBCXX_3.4' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x324): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `vtable for std::basic_streambuf<char, std::char_traits<char> >@@GLIBCXX_3.4' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x3cf): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `vtable for std::basic_ios<char, std::char_traits<char> >@@GLIBCXX_3.4' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x402): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `VTT for std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >@@GLIBCXX_3.4' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x46e): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >@@GLIBCXX_3.4' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
    tmpxft_00000886_00000000-6_batch_decode.cudafe1.cpp:(.text+0x48c): additional relocation overflows omitted from the output
    build/lib.linux-x86_64-cpython-310/flashinfer/_kernels.cpython-310-x86_64-linux-gnu.so: PC-relative offset overflow in PLT entry for `PyDict_DelItemString'
    collect2: error: ld returned 1 exit status
    error: command '/usr/bin/x86_64-linux-gnu-g++' failed with exit code 1
    [end of output]

@zhyncs zhyncs requested a review from yzh119 August 7, 2024 07:09
README.md Outdated
```bash
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer/python
# workaround for undefined symbol `__gmon_start__' on A100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a link to an existing issue (if any)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered this issue when compiling FlashInfer from source on A100 machine with 8 devices (https://www.runpod.io). The image used was pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04, and I did not raise a separate issue for it. The method in this PR is a workaround that I verified on runpod.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just noticed your error message, that error appears when binary size is too large, which I should fix these days. Limiting the target CUDA arch is an option to reduce binary size, not only applies to A100. I suppose you might encounter the same issue for other GPU instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found similar issues or questions on forums through Google, but their solutions didn't resolve my issue. ref https://www.google.com/search?q=R_X86_64_REX_GOTPCRELX%20against%20undefined%20symbol%20%60__gmon_start__%27

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you might encounter the same issue for other GPU instances.

Yes😂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental solution is to break the CUDAExtension into multiple sudmoules, and compile each of them into a shared object with reasonable size. cc @Yard1 as you might be interested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the workflow running in the FlashInfer repository is functioning properly. For most users (not FlashInfer developers), using the whl compiled in the workflow is sufficient. If someone is a developer and need to modify code and compile within FlashInfer, like me, before implementing what you mentioned as 'break the CUDAExtension into multiple submodules', perhaps this is an acceptable workaround.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my suggestion here is to keep this note, but mentioning it's for reducing binary size, don't say it's only for A100.
The following function in torch can help user identify their device capability:
https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html#torch.cuda.get_device_capability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzh119 May you help check if the new changes are alright? Thanks.

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this first and then organize a FAQ page for these issues, thank you for your contribution.

@yzh119 yzh119 merged commit ddc1f09 into flashinfer-ai:main Aug 7, 2024
@zhyncs zhyncs deleted the doc branch August 7, 2024 08:12
@zhyncs zhyncs added the documentation Improvements or additions to documentation label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants