Description
This issue will track the current progress on upgrading CUDA 11.7 support, and decommission legacy CUDA version
Cuda Support Matrix as of Pytorch 1.12
CUDA | CUDNN | additional details |
---|---|---|
10.2 | 7.6.5.32 | Legacy CUDA Release, to be decommissioned issue |
11.3 | 8.3.2.44 | Stable CUDA Release |
11.6 | 8.3.2.44 | Latest CUDA Release |
Pre CUDA 11.7 Upgrade
This issue is required to move CUDA 11.6 to Stable version. And we want to address it before CUDA 11.7.
- Follow Up on the usage for cudatoolkit across pytorch projects pytorch#69691 Conda-forge dependency for 11.6 for cudatoolkit. In short Since CUDA 11.5, cudatoolkit is only available on conda-forge channel. We should migrate from cudatoolkit to cuda and abandon usage of conda-forge from pytorch, torchvision and torchaudio. This work should be scheduled and addressed as soon as we cut release 1.12 for pytorch and all domain libraries.
Decommission CUDA 10.2
This can be done in parallel to CUDA 11.7 upgrade. We want to ultimately address it before 11.7, but can also be done in parallel.
- Decomission CUDA 10.2 support #1026 Decommission CUDA 10.2 Support. We have an open issue to track this: issue and related discussion . With CUDA 11+ users can not download it from pip. And pip is a very popular package manager.
Upgrade CUDA 11.7
As per https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD
- Installing to conda-builder and libtorch containers
- Push pytorch/conda-builder
- Push the libtorch image
- Add setup to manywheels
- Push pytorch/manylinux-builder
- Update MAGMA
- Push magma-cuda117 to conda
- Add magma for windows into our S3
- Add Windows builder for 11.7
- Check if driver needs to be updated
- Add fixes that had to come up
- Include CUDA 11.7 into our nightly matrix
- Update conda
build_pytorch.sh
script and add conda binaries - Windows
- Linux
- MacOS
- Add fixes that had to come up
- Update conda
- Create 11.7 CI
- Windows
- Linux + add MAGMA to CI conda
- Add 11.7 to torchvision CI
- Add 11.7 to torchaudio CI
Past Issues to be Resolved by upgrade (needs to be retested)
- Pytorch linalg test failure with cuda 11.6 pytorch#75391
- Pytorch test failure with CUDA 11.6 pytorch#75375
- test_linalg_solve_triangular_large fails "CUDA error: too many resources requested for launch" on win cuda pytorch#70111
- Compilation of <torch/extension.h> error on Windows CUDA 11.5 pytorch#69460
-
TestProfilerCUDA. test_mem_leak
failing for CUDA 11.5 on Linux pytorch#69023 -
TestLinAlgCUDA.test_inverse_errors_large_cuda_float64
failing for CUDA 11.3 on Windows pytorch#57482
Post CUDA 11.7 Upgrade
- Evaluate CUDA 11.6 readiness #1106
- Decommission CUDA 11.3 #1123
- Move CUDA 11.6 as Stable CUDA
Target End State
CUDA 11.6 - Stable, CUDA 11.7 - Latest Experimental
CUDA 10.2 and CUDA 11.3 Decommissioned
BE tasks for Meta Team
- Eliminate runbook manual step 6 by fixing this issue
Automate packer builds for aws/ami/windows test-infra#92
cc @ptrblck @malfet @seemethere @ezyang @pytorch/pytorch-dev-infra @ngimel
Activity