Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic cuda wrapper #1065

Closed

Conversation

wkpark
Copy link
Contributor

@wkpark wkpark commented Feb 15, 2024

This is a proof-of-concept, quick and dirty, CUDA version independent, dymamic CUDA library wrapper fix.

  • cuda_setup/main.py compatible with older libraries. (older libraries mean libs with cuda version tags)
  • all used cublas*, cusparse*, cublasLt* functions are wrapped using dlsym()/GetProcessAddress()
  • minimize cuda runtime functions that are linked with cudart_static lib.

ubuntu-20.04 + cuda11.8 environment.

$ ldd venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda.so
        linux-vdso.so.1 (0x00007ffc44cc9000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f858d075000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f858d052000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f858d04c000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f858ce6a000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f858cd1b000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f858cd00000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f858cb0c000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f858f3fc000)

windows 10 + cuda11.8 environment (mingw64 terminal)

ldd bitsandbytes/libbitsandbytes_cuda.dll
       	ntdll.dll => /c/WINDOWS/SYSTEM32/ntdll.dll (0x7ffd23ed0000)
       	KERNEL32.DLL => /c/WINDOWS/System32/KERNEL32.DLL (0x7ffd23650000)
       	KERNELBASE.dll => /c/WINDOWS/System32/KERNELBASE.dll (0x7ffd217c0000)
       	msvcrt.dll => /c/WINDOWS/System32/msvcrt.dll (0x7ffd21f10000)
       	ucrtbase.dll => /c/WINDOWS/System32/ucrtbase.dll (0x7ffd21de0000)
       	VCRUNTIME140.dll => /c/WINDOWS/SYSTEM32/VCRUNTIME140.dll (0x7ffd0d890000)
       	MSVCP140.dll => /c/WINDOWS/SYSTEM32/MSVCP140.dll (0x7ffd0d8b0000)
       	ole32.dll => /c/WINDOWS/System32/ole32.dll (0x7ffd23710000)
       	RPCRT4.dll => /c/WINDOWS/System32/RPCRT4.dll (0x7ffd22730000)
       	combase.dll => /c/WINDOWS/System32/combase.dll (0x7ffd23a20000)
       	GDI32.dll => /c/WINDOWS/System32/GDI32.dll (0x7ffd22a20000)
       	win32u.dll => /c/WINDOWS/System32/win32u.dll (0x7ffd21ee0000)
       	gdi32full.dll => /c/WINDOWS/System32/gdi32full.dll (0x7ffd21c20000)
       	msvcp_win.dll => /c/WINDOWS/System32/msvcp_win.dll (0x7ffd21640000)
       	vcruntime140_1.dll => /c/Windows/System32/vcruntime140_1.dll (0x180000)
       	USER32.dll => /c/WINDOWS/System32/USER32.dll (0x7ffd23370000)
       	VCRUNTIME140_1.dll => /c/WINDOWS/SYSTEM32/VCRUNTIME140_1.dll (0x7ffd0d880000)
       	IMM32.DLL => /c/WINDOWS/System32/IMM32.DLL (0x7ffd22650000)

 * fix cmake version: default cmake version of ubuntu20.04 is too low
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@akx
Copy link
Contributor

akx commented Feb 15, 2024

Ah, this will definitely need to be rebased after #898 and #1041 get merged.

Comment on lines +171 to +179
if(NOT USE_CUDA_WRAPPER)
target_link_libraries(bitsandbytes PUBLIC CUDA::cublas CUDA::cusparse)
endif()
if(NO_CUBLASLT)
target_compile_definitions(bitsandbytes PUBLIC NO_CUBLASLT)
else()
target_link_libraries(bitsandbytes PUBLIC CUDA::cublasLt)
if(NOT USE_CUDA_WRAPPER)
target_link_libraries(bitsandbytes PUBLIC CUDA::cublasLt)
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake has CMAKE_CUDA_RUNTIME_LIBRARY to control this; maybe would be better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_RUNTIME_LIBRARY.html
docuement says CMAKE_CUDA_RUNTIME_LIBRARY only control cudart
(Currently, with or without this PR, cudart_static added by default)

@akx
Copy link
Contributor

akx commented Mar 4, 2024

I'm a little worried that the API or ABI could have changed between 11 and 12, causing very weird runtime errors if we just use symbols from v12 and call them like they were v11, or vice versa. 🤔

@Titus-von-Koeller
Copy link
Collaborator

Hey @wkpark @akx @matthewdouglas,

I'm hoping to merge this in the coming two weeks. One thing I feel we need to discuss is how we can test if this works as intended without introducing any issues:

Do you have any thoughts on that? I'm willing to do the work, but if you already have a few concrete things in mind, that would definitely help speed this up.

Thanks @wkpark on your initiative on this, really appreciated! Another thing needed would be to resolve the conflicts.

@Titus-von-Koeller
Copy link
Collaborator

P.S. What's your opinion on how this goes together with #1052? I don't fully grok that yet. From what I understand #1052 determines major CUDA version. Is #1052 dependent on this one here?

@matthewdouglas
Copy link
Member

matthewdouglas commented Apr 8, 2024

I'm not sure how I feel about this one yet. I will think about it a little more but my initial thought is that it seems to be quite hacky feeling. Maybe this is a common thing in C++ world, but I'm not used to seeing it.

Edit: This would definitely conflict with what I was thinking with #1126 too.

@akx
Copy link
Contributor

akx commented Apr 9, 2024

I also get a hacky/too-complex feeling here, but I'm not sure what the better option would be.

@Titus-von-Koeller
Copy link
Collaborator

Thanks for your valuable inputs ❤️ Really appreciated.

@TimDettmers
Copy link
Collaborator

Thank you so much for this thoughtful PR. We were thinking about binary size over time and this is an important issue together with support for many different CUDA minor versions. Your work on this is truly appreciated.

Overall, this would be very nice to lead to a smaller binary size and support all kind of different cuda minor versions we are currently not supporting with a direct pip install. Your approach is clever and well-implemented. However, currently we are very limited in providing support and guidance for debugging installation issues that come from users that might have problems with their CUDA setup. In particular, in the past, we had a more dynamic installation and it cause problems if a user had multiple CUDA installations.

As such, we do not see us in the position at the moment to be able to support or switch to a dynamic CUDA wrapper since it would add additional maintenance burden. This means, we will only support a few CUDA options at this point.

We will evaluate the way we ship binaries and support binaries every now and then and it might be that we will come back to your PR if we decide to go with dynamic CUDA wrapping. Your contribution has certainly given us valuable insights to consider for the future.

We genuinely appreciate the time and effort you've put into this PR, and we hope you'll continue to be an active part of our open-source community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants