You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[CI][AMD] Add AMD GPU CI and fix some related bugs (#694)
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.
* Remove obsolete test script for AMD example, streamlining the examples directory.
* Remove unused dtype_size variable in AMD example script to streamline code.
* Add input configuration file and update AMD example script for enhanced flexibility
- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.
* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.
* Refactor AMD example script for FlashAttention-2
- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.
* Refactor formatting in AMD FlashAttention example script
- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.
* Update example_amd_flash_attn_fwd.py
* Update AMD FlashAttention example and TVM submodule
- Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
- Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
- Updated the TVM submodule to the latest commit for improved functionality.
- Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
* Add CI workflow for automated format checking and testing
- Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
- The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
- Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
* Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.
* Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.
* Update AMD CI workflow to install pytest directly instead of using requirements-test.txt
* Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt
* Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation
* Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.
* Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
* Add installation of ROCm in AMD CI workflow
- Included a step to execute the `install_rocm.sh` script for improved setup.
- Removed unnecessary blank line for better readability in the workflow script.
* Remove installation step for ROCm in AMD CI workflow to simplify the setup process.
* Update AMD CI workflow to run specific test file with verbose output instead of all tests.
* Add new tilelang built-in operations for AMD architecture
- Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
- Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
* Enhance autotuner configurations and GEMM operations in AMD example
- Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
- Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
* Update autotuner configurations in AMD example for enhanced performance
- Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
- Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
* Enhance autotuner configurations and memory handling in AMD example
- Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
- Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.
* Refine autotuner configurations and memory usage in AMD example
- Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
- Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.
* Update autotuner configurations in AMD example for enhanced performance
- Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
- Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.
* Enhance autotuner configurations and GEMM operations in AMD example
- Expanded thread counts in `get_configs` to include higher values for improved autotuning.
- Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.
* Update AMD CI workflow and remove obsolete test script
- Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
- Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.
* Remove TVM subproject from 3rdparty directory
* Refactor configuration generation and accumulation logic in AMD example
- Reformatted the `get_configs` function for improved readability by aligning parameters.
- Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
* Enhance AMD CI workflow with additional logging and setup steps
- Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
- Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
* Comment out package copying in AMD CI workflow to prevent potential issues during environment setup
* Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps
* Enhance BuildTileLangHIP function by adding whitespace for improved readability
* Refactor kTVMGridConstant definition for clarity and remove unnecessary comment
* Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
* lint fix
* Update AMD CI workflow to use requirements-rocm.txt for dependency installation
* fix ci
* Remove dependency on format-check from AMD CI workflow
* fix ci
* fix ci
* fix ci
* Remove format-check job from AMD CI workflow
* Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow
* Add dependency on format-check job in AMD CI workflow
* Add format-check job to AMD CI workflow
* Update format-check job in AMD CI workflow to run on self-hosted environment
* Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes
* Update amd_ci.yml
---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
0 commit comments