So you wanna upgrade PyTorch to support a new CUDA? Follow these steps in order! They are adapted from the CUDA 11.3 and CUDA 11.2 processes.
Make an issue to track the progress, for example #56721: Support 11.3. This is especially important as many PyTorch external users are interested in CUDA upgrades.
There are three types of Docker containers we maintain in order to build Linux binaries: conda
, libtorch
, and manywheel
. They all require installing CUDA and then updating code references in respective build scripts/Dockerfiles. (Code reference: PR 719, PR 720, and PR 724)
- Modifying
install_cuda.sh
:- Find the CUDA install link here
- Get the cudnn link from NVIDIA on the PyTorch Slack
- Run the
install_113
chunk of code on your devbox to make sure it works. - Check this link to see if you need to add/remove any architectures to the nvprune list.
- Go into your cuda-11.3 folder and make sure what you're pruning actually exists. Update versions as needed, especially the visual tools like
nsight-systems
.
- Add setup for our Docker images in respective
libtorch
,conda
, andmanywheel
scripts/Dockerfiles:- For
conda
andlibtorch
, the code changes are usually copy-paste. Formanywheel
, you should manually verify the versions of the shared libraries with the CUDA you downloaded before. - To test that your code works, from the root builder repo, run something similar to
export CUDA_VERSION=11.3 && ./conda/build_docker.sh
for theconda
images. You can extend this to the other images. - Push the images to Docker Hub. Find someone who has access to Docker and the credentials to Docker Hub and have them build, tag, and push the images. This step will be automated soon with the help with GitHub Actions in the
pytorch/builder
repo. Make sure to update thecuda_version
to the version you're adding in respective YAMLs, such as.github/workflows/build-manywheel-images.yml
.
- For
Code reference: PR 728 and followup fixes in PR 751 and PR 755
- To get the CUDA install link, just like with Linux, go here and upload that
.exe
file to our S3 bucket. - To get the cuDNN install link, you could ask NVIDIA, but you could also just sign up for an NVIDIA account and access the needed
.zip
file at this link. First click oncuDNN Library for Windows (x86)
and then upload that zip file to our S3 bucket. - NOTE: When you upload files to S3, make sure to make these objects publicly readable so that our CI can access them!
- Most times, you have to upgrade the driver install for newer versions, which would look like updating the
windows/internal/driver_update.bat
file
Compile MAGMA with the new CUDA version.
- Our linux CUDA jobs use conda, so we need to build magma-cuda113 and push it to anaconda. (Code reference: PR 721).
- Currently, this is mainly copy-paste in
magma/Makefile
if there are no major code API changes/deprecations to the CUDA version. Previously, we've needed to add patches to MAGMA, so this may be something to check with NVIDIA about. - To push the package, please follow the instructions here.
- NOTE: This step relies on the conda-builder image, so make sure you have pushed the new conda-builder prior.
- Currently, this is mainly copy-paste in
- Our windows jobs download MAGMA binaries from our S3
ossci-windows
bucket, which means that those binaries need to exist.- Lucky for you! There is a GitHub Actions workflow (as of PR 762) that automates this upload when you update
windows/internal/build_magma.bat
to deal with your new CUDA version. Also remember to update thecuda_version
in.github/workflows/build-magma-windows.yml
to be the new version. (Code reference: PR 751) - NOTE: this step should occur AFTER you update the Windows builder code so that the new version is installed correctly.
- Lucky for you! There is a GitHub Actions workflow (as of PR 762) that automates this upload when you update
Testing the new version in CI is crucial for finding regressions and should be done ASAP along with the next step (I am simply putting this one first as it is usually easier).
- If the new CUDA version requires a new driver, the CI and binaries would also need the new driver. Find the driver download here and update the link like so.
- For Linux, we need to update code to use the magma we built! This can be done in the same PR when you actually add Linux CI, but here's an independent example for 11.2: PR 50559
- The configuration files will be subject to change, but usually you just have to replace an older CUDA version with the new version you're adding. Code reference for 11.2: PR 51888 for Linux and PR 51598 for Windows, and code reference for 11.3 where we just replaced verbatim yaml and updated magma for conda for Linux: PR 57223 for Windows and PR 57222 for Linux
- It is likely that there will be tests that no longer pass with the new CUDA version or GPU driver. Disable them for the time being, notify people who can help, and make issues to track them (like so).
Adding the new version to nightlies allows PyTorch binaries compiled with the new CUDA version to be available to users through conda
or pip
or just raw libtorch
.
- The difficulty in this task is NOT changing the config--you only need to modify this line--but the debugging process that ensues.
- Since this change should not touch other build jobs and it is very likely you would be running these jobs on the CI frequently, I'd advise reducing the config to only the build jobs for the new CI version and to use your own fork of
pytorch/builder
. Code reference: PR 57522. - Don't be afraid to ask questions when you're stuck on any bug!
Congrats! PyTorch now has support for a new CUDA version and you made it happen!