-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build issues with local CUDA installation #23689
Comments
Hi @adamjstewart GCC compiler is not officially supported by JAX. I recommend using Clang. You can pass the clang path in |
If you absolutely need to use GCC, we have an experimental support that can be enabled like this:
|
I tried adding these flags but I still see the exact same error:
|
Would you paste the full stack trace here please? I'd like to make sure that |
Here you go: |
Hmm, one more suggestion: try this The reason why your build fails is that GCC is unable to compile CUDA dependencies, it should be done with NVCC compiler. |
Still the same issue:
|
This is what I've tried:
The subcommand I got:
I didn't get the I assume that something in the environment variables on your machine messes up the subcommand configuration. |
There is the Alternatively, I tried to build it with Clang and local CUDA, CUDNN, and NCCL but other issues occure.
Specifically, I run build/bazel-6.5.0-linux-x86_64 run --verbose_failures=true \
--repo_env=LOCAL_CUDA_PATH=/opt/cuda \
--repo_env=LOCAL_CUDNN_PATH=/usr \
--repo_env=LOCAL_NCCL_PATH=/usr \
//jaxlib/tools:build_wheel -- \
--output_path=$PWD/dist --cpu=x86_64 \
--jaxlib_git_hash=78ade74d695407306461718a6d73cfed89b4d972 Also, I add the following # .bazelrc.user
build --strategy=Genrule=standalone
build --action_env CLANG_COMPILER_PATH="/usr/bin/clang-18"
build --repo_env CC="/usr/bin/clang-18"
build --repo_env BAZEL_COMPILER="/usr/bin/clang-18"
build --copt=-Wno-error=unused-command-line-argument
build --copt=-Wno-gnu-offsetof-extensions
build --config=avx_posix
build --config=mkl_open_source_only
build --config=cuda
build --config=nvcc_clang
build --action_env=CLANG_CUDA_COMPILER_PATH=/usr/bin/clang-18
build --repo_env HERMETIC_PYTHON_VERSION="3.12" Dependency versions follow. $ pacman -Qs '(cuda|cudnn|clang)'
local/clang 18.1.8-4
C language family frontend for LLVM
local/compiler-rt 18.1.8-1
Compiler runtime libraries for clang
local/cuda 12.6.2-2
NVIDIA's GPU programming toolkit
local/cudnn 9.2.1.18-1
NVIDIA CUDA Deep Neural Network library |
This looks like a problem with GCC installation. Looking at the error above, I suggest running this command: |
I reproduce the issue for $ /usr/lib/llvm14/bin/clang-14 -v
clang version 14.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/lib/llvm14/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/13.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/13.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1
Selected GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64 Also I have appended $ (cd ... && .../crosstool_wrapper_driver_is_not_gcc ... -v)
Selected GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1
...
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/x86_64-pc-linux-gnu
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/backward
/usr/lib/llvm14/lib/clang/14.0.6/include
/usr/local/include
/usr/include
End of search list.
external/xla/xla/tsl/cuda/cupti_stub.cc:16:10: fatal error: 'third_party/gpus/cuda/extras/CUPTI/include/cupti.h' file not found
#include "third_party/gpus/cuda/extras/CUPTI/include/cupti.h"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
$ ls -l /usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/stdlib.h
-rw-r--r-- 1 root root 2.3K Sep 10 13:07 /usr/lib64/gcc/x86_64-pc-linux- gnu/14.2.1/../../../../include/c++/14.2.1/stdlib.h But it is a bit odd that now |
There is indeed no directory find -L bazel-jax-jax-v0.4.34 -name 'CUPTI' UPD Is it upstream issue (XLA)? |
Would you check if your local CUDA installation has CUPTI headers please? Specifically, the following headers should be present: Also please check that the structure of the local CUDA/CUDNN/NCCL dirs is exactly the same as described here. |
Sure. I checked and CUPTI is where it should be (i.e.
However, all these directories are empty. I compare how these directories are looks for other CUDA library (e.g. mkdir -p bazel-out/k8-opt/bin/external/cuda_cupti/_virtual_includes/headers/third_party/gpus/cuda/extras/CUPTI
ln -s /opt/cuda/extras/CUPTI/include \
bazel-out/k8-opt/bin/external/cuda_cupti/_virtual_includes/headers/third_party/gpus/cuda/extras/CUPTI
ln -s /opt/cuda/extras/CUPTI/include \
bazel-out/k8-opt/bin/external/cuda_cupti/include and run
It seems that Is this trailing slash important? Other UPD Manual editing of |
The issue is that This is how CUDA folder should look like:
So all headers should be located in Also please note that local CUDA installation is not a recommended approach for building from sources. |
I have already tried it. I copied everything from |
As far as I understand, you use the command below:
Would you confirm that all CUDA headers are located in |
Absolutely.
Link. Since target build/bazel-6.5.0-linux-x86_64 build --verbose_failures=true \
--repo_env=LOCAL_CUDA_PATH=/opt/cuda \
--repo_env=LOCAL_CUDNN_PATH=/usr \
--repo_env=LOCAL_NCCL_PATH=/usr \
@xla//xla/tsl/cuda:cupti_stub |
Can you check this folder please? Please don't build |
Yes, it have
Target $ build/bazel-6.5.0-linux-x86_64 query \
--repo_env=LOCAL_CUDA_PATH=/opt/cuda \
--repo_env=LOCAL_CUDNN_PATH=/usr \
--repo_env=LOCAL_NCCL_PATH=/usr \
"deps(kind(rule, deps(//jaxlib/tools:build_wheel)))" | grep cupti
...
@xla//xla/tsl/cuda:cupti_stub
... |
I put all auxiliary options to build --strategy=Genrule=standalone
build --action_env CLANG_COMPILER_PATH="/usr/lib/llvm14/bin/clang-14"
build --repo_env CC="/usr/lib/llvm14/bin/clang-14"
build --repo_env BAZEL_COMPILER="/usr/lib/llvm14/bin/clang-14"
build --copt=-Wno-error=unused-command-line-argument
build --copt=-Wno-gnu-offsetof-extensions
build --config=avx_posix
build --config=mkl_open_source_only
build --config=cuda
build --config=cuda_nvcc
build --action_env=CLANG_CUDA_COMPILER_PATH="/usr/lib/llvm14/bin/clang-14"
build --repo_env HERMETIC_PYTHON_VERSION="3.12" |
The headers are commented out in two cases:
Here are my results:
|
It seems that missing header error is caused by
And
No idea how to easily fix the issue. Adding |
when i running the command on pretty much default arch linux box,
i don't have a problem with locating stdlib header but have another problem related to it:
|
Can you try this command please?
Please note that We are planning to update instructions how to build JAX from source. |
then i have the same problem as daskol :
if adding:
message about missing "third_party/gpus/cuda/extras/CUPTI/include/cupti.h" i've also tried several hacks, like
and rebuilding after cleaning bazel cache - but still having either one or another of two above error messages |
May I ask you to describe your use case? Why is it necessary to use the local CUDA path in your scenario? |
…ath. `cc.endswith("clang")` ddidn't work for the cases when the clang compiler path is like `/usr/bin/clang-18`. This change addresses [Github issue](jax-ml/jax#23689). PiperOrigin-RevId: 693735256
…ath. `cc.endswith("clang")` ddidn't work for the cases when the clang compiler path is like `/usr/bin/clang-18`. This change addresses [Github issue](jax-ml/jax#23689). PiperOrigin-RevId: 693735256
…ath. `cc.endswith("clang")` ddidn't work for the cases when the clang compiler path is like `/usr/bin/clang-18`. This change addresses [Github issue](jax-ml/jax#23689). PiperOrigin-RevId: 693735256
…ath. `cc.endswith("clang")` ddidn't work for the cases when the clang compiler path is like `/usr/bin/clang-18`. This change addresses [Github issue](jax-ml/jax#23689). PiperOrigin-RevId: 694536448
…ath. `cc.endswith("clang")` ddidn't work for the cases when the clang compiler path is like `/usr/bin/clang-18`. This change addresses [Github issue](jax-ml/jax#23689). PiperOrigin-RevId: 694536448
openxla/xla#19113 is merged now. |
…ath. `cc.endswith("clang")` ddidn't work for the cases when the clang compiler path is like `/usr/bin/clang-18`. This change addresses [Github issue](jax-ml/jax#23689). PiperOrigin-RevId: 694536448
Tried upgrading to jaxlib 0.4.37 but now I have a different error even earlier in the build: #25488 |
Okay, on jaxlib 0.4.38 with the patch from #25531 applied, the GCC x86_64 CPU-only build works, but the GCC x86_64 CUDA build still fails. The error message is now different:
I'm not sure why it's hard-coded to search for |
Hi @adamjstewart , would you check if you have the symlink It should point to Also the content of The reason why it searches for In the log I can see that
|
This is in CI so I can't easily reproduce in that specific environment, let me try building locally and get back to you. |
Okay, I was able to reproduce the build failure locally. Here is the new build log for posterity. Yes, the However, the cc_library(
name = "headers",
#hdrs = glob([
#...
#"include/cuda.h",
#...
#]),
include_prefix = "third_party/gpus/cuda/include",
includes = ["include"],
strip_include_prefix = "include",
visibility = ["@local_config_cuda//cuda:__pkg__"],
) I'm guessing this is the source of the issue. Any idea why this would be commented out? |
The lines can be commented out when the repository rule can't find the In particular, if Can you check this path please? Does it exist? |
Okay, here is the issue. The |
The usage of local installations is not recommended in general. The guidance for the folders structures is provided in this paragraph.
This structure corresponds to the redistributions content that can be downloaded from NVIDIA source. I suggest adding a symlink |
Ah, my problem is that I'm downloading and installing CUDA using the official runfiles provided by https://developer.nvidia.com/cuda-downloads, which seems to default to lib64 on my system. It would be nice if XLA could also support the runfile layout instead of only the redistribution layout, but I can bring this up with them another day. Confirming whether or not adding this symlink solves my issue and I will get back to you. |
And we're on to the next issue:
As expected,
Looks like the exact same issue, there is a |
Please note that |
Oh. Well that's not where it lives. Am I supposed to symlink every single file in my CUDA installation to get XLA building, or should I just change XLA to support the default CUDA installation scheme? |
Hi @adamjstewart . If there are |
I'm using the standard flow generated by the official CUDA runfile. I tried adding symlinks in many places but still can't manage to get things to build. May open an issue with XLA if I have time. |
FYI I am still having the issue with building @ybaturina What do you think about renaming the issue in order to accurately reflect building issues with local CUDA across multiple JAX versions? |
I don't think building jax with gcc 14 standard library works as of now. You may pass Regarding using local cuda installation for builds: the only reason why it still exists is to support workflow when you need to test two unreleased versions of the components together: unreleased jax and unreleased cuda, i.e. it is mainly for nvidia developers who work on cuda directly. Unless you are actually developing cuda stuff, there should be no reason to ever depend on local installation, it will just make life harder for you, please do not use it unless you have a very very specific reason to. |
The issue is that JAX is distributed mainly as Python wheels but it gets in contradiction to Linux distributions like Arch or Nix which strive to vendor all dependencies for better reproducibility and other reasons. From my perspective, it is indeed strange to have multiple CUDA runtimes on one system. |
My "very very specific reason" is that I'm building for a supercomputer where CUDA has already been installed by the system administrators and I don't want to install an incompatible version. I'm also packaging JAX in a package manager (Spack) so that users can install the entire software stack they need without installing multiple incompatible versions of CUDA. Packages that automagically install their own dependencies are a fundamental challenge for any secure system, especially on an air-gapped network where the only source available is what is manually copied to the server. |
tl;dr; Nothing is being installed on your system during build. With hermetic cuda nothing is being installed on your system and this is the main point here. It is being pulled in isolated bazel cache during your build together with many other unrelated dependencies (which you have always been pulling during the build with or without cuda), keeping your machine's environment clean and intact. All build dependencies, including cuda are pulled and checked against their sha256 sums from trusted sources (cuda in particular is pulled from official nvidia source). JAX's build is complex, and cuda dependencies are on the more complex side of it. Having a specific version of cuda on your system and trying to build against it will not make your build more robust (it will have exactly opposite effect), as there are many other "players" in the game: compatibility of JAX itself with a specific cuda version, wiring of cuda headers (compile time) and libraries (linking time) to the rest of the bulid, packaging of the final wheel etc and all the custom bazel logic which makes that possible and makes assumptions about what exactly is in your cuda deps.
There are many other dependencies besides cuda that get downloaded during the build, so in air-gapped case it would either still not work with or withut cuda, or if you already provide many dozens of other dependencies from custom https source, just add a dozen of cuda deps to the already long list of other deps you are already providing. In other words, non-hermetic cuda dependencies have always been non-idiomatic, nightmare to maintain, non-secure (it actually used to crawl your system to figure out what is where on your machine) extremely fragile enormous build hack which we finally fixed. Please let it go. Controlling your own installation does not let you control the build, as you still rely on all the custom wiring around it which makes assumptions about what it wires together. Nothing is being installed on your machine during build. |
Ubuntu 24.04 + Cuda-12-6 depend on gcc-13
|
We can resolve the For example, in XLA, you can add the
Key Changes in
now:
For more details, see the related discussion: |
Description
When building jaxlib with an externally installed copy of CUDA (something required by all package managers and HPC systems), I see the following error:
It's possible I'm passing the wrong flags somewhere. I'm using:
> python3 build/build.py --enable_cuda --cuda_compute_capabilities=8.0 --bazel_options=--repo_env=LOCAL_CUDA_PATH=... --bazel_options=--repo_env=LOCAL_CUDNN_PATH=... --bazel_options=--repo_env=LOCAL_NCCL_PATH=...
(of course, with ... replaced by the actual paths)
System info (python version, jaxlib version, accelerator, etc.)
Build log
The text was updated successfully, but these errors were encountered: