-
Notifications
You must be signed in to change notification settings - Fork 592
docs: update README #426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: update README #426
Conversation
README.md
Outdated
| ```bash | ||
| git clone https://github.com/flashinfer-ai/flashinfer.git --recursive | ||
| cd flashinfer/python | ||
| # workaround for undefined symbol `__gmon_start__' on A100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a link to an existing issue (if any)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I encountered this issue when compiling FlashInfer from source on A100 machine with 8 devices (https://www.runpod.io). The image used was pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04, and I did not raise a separate issue for it. The method in this PR is a workaround that I verified on runpod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just noticed your error message, that error appears when binary size is too large, which I should fix these days. Limiting the target CUDA arch is an option to reduce binary size, not only applies to A100. I suppose you might encounter the same issue for other GPU instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found similar issues or questions on forums through Google, but their solutions didn't resolve my issue. ref https://www.google.com/search?q=R_X86_64_REX_GOTPCRELX%20against%20undefined%20symbol%20%60__gmon_start__%27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you might encounter the same issue for other GPU instances.
Yes😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental solution is to break the CUDAExtension into multiple sudmoules, and compile each of them into a shared object with reasonable size. cc @Yard1 as you might be interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the workflow running in the FlashInfer repository is functioning properly. For most users (not FlashInfer developers), using the whl compiled in the workflow is sufficient. If someone is a developer and need to modify code and compile within FlashInfer, like me, before implementing what you mentioned as 'break the CUDAExtension into multiple submodules', perhaps this is an acceptable workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my suggestion here is to keep this note, but mentioning it's for reducing binary size, don't say it's only for A100.
The following function in torch can help user identify their device capability:
https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html#torch.cuda.get_device_capability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yzh119 May you help check if the new changes are alright? Thanks.
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this first and then organize a FAQ page for these issues, thank you for your contribution.
fix cc @yzh119