Skip to content

Fixes MX formats build for blackwell #2214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

syed-ahmed
Copy link
Contributor

@syed-ahmed syed-ahmed commented May 15, 2025

When TORCH_CUDA_ARCH_LIST doesn't have 9.0a, setup.py first compiles the files in sources and then compiles the files in cutlass_90a_sources, overwriting the _C.abi3.so that was produced before:

    [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [2/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [3/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/fp6_llm/fp6_linear.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [4/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 905; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 919; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 1057; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 1071; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    [5/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [6/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [7/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    aarch64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -shared -Wl,-O1 -Wl,-Bsymbolic-functions /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o -L/usr/local/lib/python3.12/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/aarch64-linux-gnu -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so
    building 'torchao._C' extension
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x
    [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [2/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/activation24/sparsify24.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [3/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [4/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [5/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [6/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [7/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    aarch64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -shared -Wl,-O1 -Wl,-Bsymbolic-functions /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o -L/usr/local/lib/python3.12/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/aarch64-linux-gnu -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so
    copying build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so -> torchao
    copying build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so -> torchao

As a result of the overwrite, we get NotImplementedError for CUDA backend for the MXFP4_CUTLASS tests. In this PR, we are proposing to have a separate _C_cutlass_90a, so that we don't overwrite _C.abi3.so. It also preserves the old behavior that only files that need 90a are compiled with 90a.

Similarly, the same is done for files that need to be compiled with 100a. Currently, if we don't compile mx_fp_cutlass_kernels with 100a, we see a Cutlass cannot run error on a SM 100 machine.

CC: @drisspg @vkuzo @msaroufim

Copy link

pytorch-bot bot commented May 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2214

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f423915 with merge base 554cb60 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2025
@drisspg drisspg added the topic: bug fix Use this tag for PRs that fix bugs label May 15, 2025
@drisspg
Copy link
Contributor

drisspg commented May 15, 2025

This meeans we end up shipping 3 more .so s right? I wonder if ther isn't a better way to do this by including the files in the extension module but allowing per source flags

@syed-ahmed
Copy link
Contributor Author

This meeans we end up shipping 3 more .so s right? I wonder if ther isn't a better way to do this by including the files in the extension module but allowing per source flags

I agree with that but it seems like currently there isn't a way to specify per source flags using torch cpp_extensions. There is some machinery in cmake: https://github.com/pytorch/pytorch/blob/2362bd4a4c9bb909eb8265d3ca9854b5ec07867e/cmake/Codegen.cmake#L93-L126 and it would be nice if we could do something similar in CUDAExtension.

@drisspg
Copy link
Contributor

drisspg commented May 15, 2025

@syed-ahmed I got frustrated w/ cpp extensions and started working on: #1659
But never got the bandwidth to drive home

Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unblock, lets merge it but can you open an Issue stating that we currently need an SO per cuda-arch and we should either fix this in cpp_extensions are migrate to a pure cmake build in order to resolve this

@syed-ahmed
Copy link
Contributor Author

Sounds good! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: bug fix Use this tag for PRs that fix bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants