Skip to content

Support ROCM builds from source distribution, and improve error handling #1446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 18, 2025

Conversation

mgorny
Copy link
Contributor

@mgorny mgorny commented Jan 15, 2025

Currently, the source distribution contains only sources needed for CUDA builds (it's missing csrc/composable_kernel). Fix that, and while at it clean up some logic in setup.py. More specifically:

  • always update both submodules when inside the git repo, so that they are both included in sdist
  • add missing .py files to sdist
  • run git submodule ... commands only when inside the git repo; otherwise (e.g. when installing from sdist), just verify that the submodules were checked out, and issue a more explanatory error if they weren't
  • use subprocess.run() throughout to run commands, and check their return status

Always update both submodules, irrespectively of whether a CUDA
or a ROCM build is being done, to ensure that the necessary files
from both are present in sdist.  Otherwise, attempt to perform a ROCM
build from sdist fails because of missing `composable_kernel` srouces.
Include the `*.py` files from `csrc` in sdist, to ensure that
the `generate.py` script is present.
Add error checking to ensure that `setup.py` fails immediately if one
of the commands fail.  Otherwise, the failures result only in messages
to stderr that could be missed, and could lead to more confusing errors
later in the build process.
Call git commands in `setup.py` only when the `.git` directory is
present, indicating that we are working in a git checkout.  Otherwise,
just assert that the needed files are there.  With this, building
from a source distribution no longer attempts to call git
in an incorrect directory.
@tridao tridao merged commit 6b1d059 into Dao-AILab:main Jan 18, 2025
@mgorny
Copy link
Contributor Author

mgorny commented Jan 18, 2025

Thank you!

@mgorny mgorny deleted the git-improve branch January 18, 2025 07:19
LucasWilkinson pushed a commit to vllm-project/flash-attention that referenced this pull request Feb 11, 2025
…ing (Dao-AILab#1446)

* Always update both submodules to include them in sdist

Always update both submodules, irrespectively of whether a CUDA
or a ROCM build is being done, to ensure that the necessary files
from both are present in sdist.  Otherwise, attempt to perform a ROCM
build from sdist fails because of missing `composable_kernel` srouces.

* Include `*.py` files from composable_kernel in sdist

Include the `*.py` files from `csrc` in sdist, to ensure that
the `generate.py` script is present.

* Replace the `os.system()` calls in `setup.py` with `subprocess.run()`

* Add error checking to `subprocess.run()` calls in `setup.py`

Add error checking to ensure that `setup.py` fails immediately if one
of the commands fail.  Otherwise, the failures result only in messages
to stderr that could be missed, and could lead to more confusing errors
later in the build process.

* Call git in `setup.py` only when working in a git repository

Call git commands in `setup.py` only when the `.git` directory is
present, indicating that we are working in a git checkout.  Otherwise,
just assert that the needed files are there.  With this, building
from a source distribution no longer attempts to call git
in an incorrect directory.
LucasWilkinson pushed a commit to vllm-project/flash-attention that referenced this pull request Feb 11, 2025
…ing (Dao-AILab#1446)

* Always update both submodules to include them in sdist

Always update both submodules, irrespectively of whether a CUDA
or a ROCM build is being done, to ensure that the necessary files
from both are present in sdist.  Otherwise, attempt to perform a ROCM
build from sdist fails because of missing `composable_kernel` srouces.

* Include `*.py` files from composable_kernel in sdist

Include the `*.py` files from `csrc` in sdist, to ensure that
the `generate.py` script is present.

* Replace the `os.system()` calls in `setup.py` with `subprocess.run()`

* Add error checking to `subprocess.run()` calls in `setup.py`

Add error checking to ensure that `setup.py` fails immediately if one
of the commands fail.  Otherwise, the failures result only in messages
to stderr that could be missed, and could lead to more confusing errors
later in the build process.

* Call git in `setup.py` only when working in a git repository

Call git commands in `setup.py` only when the `.git` directory is
present, indicating that we are working in a git checkout.  Otherwise,
just assert that the needed files are there.  With this, building
from a source distribution no longer attempts to call git
in an incorrect directory.
LucasWilkinson added a commit to vllm-project/flash-attention that referenced this pull request Feb 11, 2025
* Support ROCM builds from source distribution, and improve error handling (Dao-AILab#1446)

* Always update both submodules to include them in sdist

Always update both submodules, irrespectively of whether a CUDA
or a ROCM build is being done, to ensure that the necessary files
from both are present in sdist.  Otherwise, attempt to perform a ROCM
build from sdist fails because of missing `composable_kernel` srouces.

* Include `*.py` files from composable_kernel in sdist

Include the `*.py` files from `csrc` in sdist, to ensure that
the `generate.py` script is present.

* Replace the `os.system()` calls in `setup.py` with `subprocess.run()`

* Add error checking to `subprocess.run()` calls in `setup.py`

Add error checking to ensure that `setup.py` fails immediately if one
of the commands fail.  Otherwise, the failures result only in messages
to stderr that could be missed, and could lead to more confusing errors
later in the build process.

* Call git in `setup.py` only when working in a git repository

Call git commands in `setup.py` only when the `.git` directory is
present, indicating that we are working in a git checkout.  Otherwise,
just assert that the needed files are there.  With this, building
from a source distribution no longer attempts to call git
in an incorrect directory.

* [Build] Update version of setuptools used to generate core package (Dao-AILab#1460)

* Don't compile for CUDA 11, compile for official pytorch 2.6.0

* Bump to v2.7.4

* Drop Pytorch 2.1

* [FA3] Compile with nvcc 12.8 instead of 12.3

* Fix comment in assert

* [CE] Assert logit_scale > 0

* Implement HeadDim_V != HeadDim_QK, support hdimQK=192, hdimV=128

* Fix shape_O in epilogue params when kHeadDimV != kHeadDim

* Remove old combine.h

* Fix loading paged V when kHeadDimV != kHeadDim

* Fix shape_V for storing new KV when kHeadDimV != kHeadDim

* Implement the case of LargeHeadDimV

* Rename Mma0->MmaQK, Mma1->MmaPV, use Cluster only if hdimV >= 192

* Pass _1 or _0 to cute::aligned_struct

* Fix compilation for FP8 when kHeadDimV != kHeadDim

* Support Qv

* Test varlen_q=True by default for kvcache

* Fix num_splits heuristic being called before get_pack_gqa

* Fix num_splits heuristic again when PackGQA

* Tile fwd_combine kernel along headdim, don't need kBlockM > 128

* Use bf16 instead of fp16 in benchmark_gemm.py

* Update Cutlass to 3.7

* Use nvcc 12.6 but ptxas 12.8

* cicc uses the same version as ptxas

* Split hdimdiff into a separate translation unit

* Update benchmark script

* Update Cutlass to 3.8

* Adjust tile size for hdim 64

* Adjust ninja build file

* build head diff + fix build errors

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Michał Górny <mgorny@gentoo.org>
Co-authored-by: Aman Karmani <aman@tmm1.net>
Co-authored-by: Tri Dao <tridpq@gmail.com>
LucasWilkinson added a commit to vllm-project/flash-attention that referenced this pull request Mar 20, 2025
* Support ROCM builds from source distribution, and improve error handling (Dao-AILab#1446)

* Always update both submodules to include them in sdist

Always update both submodules, irrespectively of whether a CUDA
or a ROCM build is being done, to ensure that the necessary files
from both are present in sdist.  Otherwise, attempt to perform a ROCM
build from sdist fails because of missing `composable_kernel` srouces.

* Include `*.py` files from composable_kernel in sdist

Include the `*.py` files from `csrc` in sdist, to ensure that
the `generate.py` script is present.

* Replace the `os.system()` calls in `setup.py` with `subprocess.run()`

* Add error checking to `subprocess.run()` calls in `setup.py`

Add error checking to ensure that `setup.py` fails immediately if one
of the commands fail.  Otherwise, the failures result only in messages
to stderr that could be missed, and could lead to more confusing errors
later in the build process.

* Call git in `setup.py` only when working in a git repository

Call git commands in `setup.py` only when the `.git` directory is
present, indicating that we are working in a git checkout.  Otherwise,
just assert that the needed files are there.  With this, building
from a source distribution no longer attempts to call git
in an incorrect directory.

* [Build] Update version of setuptools used to generate core package (Dao-AILab#1460)

* Don't compile for CUDA 11, compile for official pytorch 2.6.0

* Bump to v2.7.4

* Drop Pytorch 2.1

* [FA3] Compile with nvcc 12.8 instead of 12.3

* Fix comment in assert

* [CE] Assert logit_scale > 0

* Implement HeadDim_V != HeadDim_QK, support hdimQK=192, hdimV=128

* Fix shape_O in epilogue params when kHeadDimV != kHeadDim

* Remove old combine.h

* Fix loading paged V when kHeadDimV != kHeadDim

* Fix shape_V for storing new KV when kHeadDimV != kHeadDim

* Implement the case of LargeHeadDimV

* Rename Mma0->MmaQK, Mma1->MmaPV, use Cluster only if hdimV >= 192

* Pass _1 or _0 to cute::aligned_struct

* Fix compilation for FP8 when kHeadDimV != kHeadDim

* Support Qv

* Test varlen_q=True by default for kvcache

* Fix num_splits heuristic being called before get_pack_gqa

* Fix num_splits heuristic again when PackGQA

* Tile fwd_combine kernel along headdim, don't need kBlockM > 128

* Use bf16 instead of fp16 in benchmark_gemm.py

* Update Cutlass to 3.7

* Use nvcc 12.6 but ptxas 12.8

* cicc uses the same version as ptxas

* Split hdimdiff into a separate translation unit

* Update benchmark script

* Update Cutlass to 3.8

* Adjust tile size for hdim 64

* Adjust ninja build file

* Rename collective_mainloop -> mainloop, move tile_scheduler variable

* Move functions getting number of m/n blocks to a separate file

* Update cutlass 3.8 to fix error w cudaGetDriverEntryPointByVersion

* Fix FP8 test

* make seqused optional on top level interface (Dao-AILab#1497)

* Temporarily change package name of FA3 to allow FA2 & FA3 install

* Update benchmark_split_kv.py to work w new API

* Add tp_degree to benchmark_split_kv

* Fix divide by 0 in causal tile_scheduler for large seqlen

* Use split for super long sequences that don't fit into L2

* Make rotary test optional in FA3

* Enable MLA flag in FA3 (rope=64, latent=512) (Dao-AILab#1504)

* Enable MLA flag in FA3 (rope=64, latent=512)

* updated HasQv in flash_fwd_launch_template.h

* Add simple script to benchmark MLA decode

* Add dynamic splits

* Update to Cutlass 3.8.0 tag

* Adjust seqlen_q in MLA decode benchmark script

* Fix loop in prepare_scheduler.cu (h/t Jay Shah)

Only affects the case where batch size > 256

* fix: add "typename" prior to dependent type name (Dao-AILab#1517)

This project uses c++17 which still has this requirement.

Signed-off-by: Jiang, Zhiwei <zhiwei.jiang@intel.com>

* Add FLOPS to MLA decode benchmark

* Change margin in prepare_scheduler.cu from 20% to 10%

* Fix cuda 12.1 build (Dao-AILab#1511)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Don't use IntraWGOverlap for hdim 64,512

* Remove sink token

It wasn't working anyway

* fix: prompt index to type longlong to avoid numerical overflow (Dao-AILab#1500)

* Add option for WG1 to use RS MMA but WG2 using SS MMA

* Add kwargs to _write_ninja_file for compatibility with new torch

* Move writing P to smem as separate function

* Fix causal scheduler not considering hdim_v != hdim

* Always split fwd_combine_kernel on batch

* For each batch, if num_splits=1, write to O instead of O_partial

* Enable TMA when page size is a multiple of kBlockN

* Update ptxas to 12.8.93 (i.e. 12.8.1)

* Use tile size 192 x 128 for hdim 64 causal

* Update benchmark_mla_decode.py

* Benchmark MHA, GQA, MQA, MLA in the same script

* Benchmark FlashMLA if it's available

* Run all 4 attn variants in benchmark

* Move scheduler.get_next_work to before the epilogue

* Enable Cluster for hdim128 back

* Move tOrO init in mainloop

* Adjust heuristic for get_pagedkv_tma

* Enable PDL

* Simplify prepare_varlen_num_blocks_kernel, restrict to batch <= 992

* Fix: num_splits_dynamic_ptr needs to be set before get_num_splits

* Loop on num_splits instead of parameterizing it in kvcache test

* Add option to precompute scheduler metadata

* Update MLA decode benchmark to use get_scheduler_metadata

* Fix FP8 test to quantize KV cache for reference impl as well

* Dynamic autotune configs for devices with warp size != 32 (Dao-AILab#1534)

Generate a list of autotune configs based on device warp size to avoid triton error if maximum threads per block is exceeded.

* update binding

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Jiang, Zhiwei <zhiwei.jiang@intel.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Michał Górny <mgorny@gentoo.org>
Co-authored-by: Aman Karmani <aman@tmm1.net>
Co-authored-by: Tri Dao <tridpq@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Ted Zadouri <tedzadouri@gmail.com>
Co-authored-by: Jiang, Zhiwei <zhiwei.jiang@intel.com>
Co-authored-by: xin-w8023 <43900898+xin-w8023@users.noreply.github.com>
Co-authored-by: schung-amd <Steven.Chung@amd.com>
tlrmchlsmth pushed a commit to vllm-project/flash-attention that referenced this pull request Apr 10, 2025
* Support ROCM builds from source distribution, and improve error handling (Dao-AILab#1446)

* Always update both submodules to include them in sdist

Always update both submodules, irrespectively of whether a CUDA
or a ROCM build is being done, to ensure that the necessary files
from both are present in sdist.  Otherwise, attempt to perform a ROCM
build from sdist fails because of missing `composable_kernel` srouces.

* Include `*.py` files from composable_kernel in sdist

Include the `*.py` files from `csrc` in sdist, to ensure that
the `generate.py` script is present.

* Replace the `os.system()` calls in `setup.py` with `subprocess.run()`

* Add error checking to `subprocess.run()` calls in `setup.py`

Add error checking to ensure that `setup.py` fails immediately if one
of the commands fail.  Otherwise, the failures result only in messages
to stderr that could be missed, and could lead to more confusing errors
later in the build process.

* Call git in `setup.py` only when working in a git repository

Call git commands in `setup.py` only when the `.git` directory is
present, indicating that we are working in a git checkout.  Otherwise,
just assert that the needed files are there.  With this, building
from a source distribution no longer attempts to call git
in an incorrect directory.

* [Build] Update version of setuptools used to generate core package (Dao-AILab#1460)

* Don't compile for CUDA 11, compile for official pytorch 2.6.0

* Bump to v2.7.4

* Drop Pytorch 2.1

* [FA3] Compile with nvcc 12.8 instead of 12.3

* Fix comment in assert

* [CE] Assert logit_scale > 0

* Implement HeadDim_V != HeadDim_QK, support hdimQK=192, hdimV=128

* Fix shape_O in epilogue params when kHeadDimV != kHeadDim

* Remove old combine.h

* Fix loading paged V when kHeadDimV != kHeadDim

* Fix shape_V for storing new KV when kHeadDimV != kHeadDim

* Implement the case of LargeHeadDimV

* Rename Mma0->MmaQK, Mma1->MmaPV, use Cluster only if hdimV >= 192

* Pass _1 or _0 to cute::aligned_struct

* Fix compilation for FP8 when kHeadDimV != kHeadDim

* Support Qv

* Test varlen_q=True by default for kvcache

* Fix num_splits heuristic being called before get_pack_gqa

* Fix num_splits heuristic again when PackGQA

* Tile fwd_combine kernel along headdim, don't need kBlockM > 128

* Use bf16 instead of fp16 in benchmark_gemm.py

* Update Cutlass to 3.7

* Use nvcc 12.6 but ptxas 12.8

* cicc uses the same version as ptxas

* Split hdimdiff into a separate translation unit

* Update benchmark script

* Update Cutlass to 3.8

* Adjust tile size for hdim 64

* Adjust ninja build file

* Rename collective_mainloop -> mainloop, move tile_scheduler variable

* Move functions getting number of m/n blocks to a separate file

* Update cutlass 3.8 to fix error w cudaGetDriverEntryPointByVersion

* Fix FP8 test

* make seqused optional on top level interface (Dao-AILab#1497)

* Temporarily change package name of FA3 to allow FA2 & FA3 install

* Update benchmark_split_kv.py to work w new API

* Add tp_degree to benchmark_split_kv

* Fix divide by 0 in causal tile_scheduler for large seqlen

* Use split for super long sequences that don't fit into L2

* Make rotary test optional in FA3

* Enable MLA flag in FA3 (rope=64, latent=512) (Dao-AILab#1504)

* Enable MLA flag in FA3 (rope=64, latent=512)

* updated HasQv in flash_fwd_launch_template.h

* Add simple script to benchmark MLA decode

* Add dynamic splits

* Update to Cutlass 3.8.0 tag

* Adjust seqlen_q in MLA decode benchmark script

* Fix loop in prepare_scheduler.cu (h/t Jay Shah)

Only affects the case where batch size > 256

* fix: add "typename" prior to dependent type name (Dao-AILab#1517)

This project uses c++17 which still has this requirement.

Signed-off-by: Jiang, Zhiwei <zhiwei.jiang@intel.com>

* Add FLOPS to MLA decode benchmark

* Change margin in prepare_scheduler.cu from 20% to 10%

* Fix cuda 12.1 build (Dao-AILab#1511)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Don't use IntraWGOverlap for hdim 64,512

* Remove sink token

It wasn't working anyway

* fix: prompt index to type longlong to avoid numerical overflow (Dao-AILab#1500)

* Add option for WG1 to use RS MMA but WG2 using SS MMA

* Add kwargs to _write_ninja_file for compatibility with new torch

* Move writing P to smem as separate function

* Fix causal scheduler not considering hdim_v != hdim

* Always split fwd_combine_kernel on batch

* For each batch, if num_splits=1, write to O instead of O_partial

* Enable TMA when page size is a multiple of kBlockN

* Update ptxas to 12.8.93 (i.e. 12.8.1)

* Use tile size 192 x 128 for hdim 64 causal

* Update benchmark_mla_decode.py

* Benchmark MHA, GQA, MQA, MLA in the same script

* Benchmark FlashMLA if it's available

* Run all 4 attn variants in benchmark

* Move scheduler.get_next_work to before the epilogue

* Enable Cluster for hdim128 back

* Move tOrO init in mainloop

* Adjust heuristic for get_pagedkv_tma

* Enable PDL

* Simplify prepare_varlen_num_blocks_kernel, restrict to batch <= 992

* Fix: num_splits_dynamic_ptr needs to be set before get_num_splits

* Loop on num_splits instead of parameterizing it in kvcache test

* Add option to precompute scheduler metadata

* Update MLA decode benchmark to use get_scheduler_metadata

* Fix FP8 test to quantize KV cache for reference impl as well

* Dynamic autotune configs for devices with warp size != 32 (Dao-AILab#1534)

Generate a list of autotune configs based on device warp size to avoid triton error if maximum threads per block is exceeded.

* Add option for rotary_seqlens

* Use StreamkBarrier0/1 barriers instead of TileCountSmemEmpty/Full

* Update Cutlass to 3.9

* Support hdim 64,256

* Update benchmark with GLA

* Adjust warp scheduler sync for HasQv case

* num_head -> args.num_head (Dao-AILab#1552)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

* Fix zeroing out the scheduler semaphore when reusing metadata

* fix deprecation warning for newer torch versions (Dao-AILab#1565)

* Don't use FusedDense anymore to simplify code

* Fix FA3 qkvpacked interface

* Launch more thread blocks in layer_norm_bwd

* check valid tile before storing num_splits in split_idx (Dao-AILab#1578)

* Tune rotary kernel to use 2 warps if rotary_dim <= 64

* update api

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Jiang, Zhiwei <zhiwei.jiang@intel.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Michał Górny <mgorny@gentoo.org>
Co-authored-by: Aman Karmani <aman@tmm1.net>
Co-authored-by: Tri Dao <tridpq@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Ted Zadouri <tedzadouri@gmail.com>
Co-authored-by: Jiang, Zhiwei <zhiwei.jiang@intel.com>
Co-authored-by: xin-w8023 <43900898+xin-w8023@users.noreply.github.com>
Co-authored-by: schung-amd <Steven.Chung@amd.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Co-authored-by: jayhshah <jayhshah@gmail.com>
playerzer0x pushed a commit to Liqhtworks/flash-attention that referenced this pull request Jul 24, 2025
…ing (Dao-AILab#1446)

* Always update both submodules to include them in sdist

Always update both submodules, irrespectively of whether a CUDA
or a ROCM build is being done, to ensure that the necessary files
from both are present in sdist.  Otherwise, attempt to perform a ROCM
build from sdist fails because of missing `composable_kernel` srouces.

* Include `*.py` files from composable_kernel in sdist

Include the `*.py` files from `csrc` in sdist, to ensure that
the `generate.py` script is present.

* Replace the `os.system()` calls in `setup.py` with `subprocess.run()`

* Add error checking to `subprocess.run()` calls in `setup.py`

Add error checking to ensure that `setup.py` fails immediately if one
of the commands fail.  Otherwise, the failures result only in messages
to stderr that could be missed, and could lead to more confusing errors
later in the build process.

* Call git in `setup.py` only when working in a git repository

Call git commands in `setup.py` only when the `.git` directory is
present, indicating that we are working in a git checkout.  Otherwise,
just assert that the needed files are there.  With this, building
from a source distribution no longer attempts to call git
in an incorrect directory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants