-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[mlir][AMDGPU] Improve DPP implementation of subgroup reduction #136804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Muzammiluddin-Syed-ECE
wants to merge
28
commits into
llvm:main
Choose a base branch
from
Muzammiluddin-Syed-ECE:muzasyed/vector
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[mlir][AMDGPU] Improve DPP implementation of subgroup reduction #136804
Muzammiluddin-Syed-ECE
wants to merge
28
commits into
llvm:main
from
Muzammiluddin-Syed-ECE:muzasyed/vector
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ops. Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
You can test this locally with the following command:git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- mlir/include/mlir/Dialect/GPU/Utils/ReductionUtils.h mlir/lib/Dialect/GPU/Utils/ReductionUtils.cpp mlir/include/mlir/Dialect/GPU/Transforms/Passes.h mlir/include/mlir/Dialect/GPU/Utils/GPUUtils.h mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp mlir/lib/Dialect/GPU/Utils/Utils.cpp mlir/test/lib/Dialect/GPU/TestGpuRewrite.cpp View the diff from clang-format here.diff --git a/mlir/include/mlir/Dialect/GPU/Utils/ReductionUtils.h b/mlir/include/mlir/Dialect/GPU/Utils/ReductionUtils.h
index f766dab8c..3c42b1f4e 100644
--- a/mlir/include/mlir/Dialect/GPU/Utils/ReductionUtils.h
+++ b/mlir/include/mlir/Dialect/GPU/Utils/ReductionUtils.h
@@ -9,9 +9,9 @@
#ifndef MLIR_DIALECT_GPU_TRANSFORMS_REDUCTIONUTILS_H_
#define MLIR_DIALECT_GPU_TRANSFORMS_REDUCTIONUTILS_H_
-#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
#include "mlir/Dialect/AMDGPU/Utils/Chipset.h"
+#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
@@ -27,7 +27,7 @@ struct ClusterInfo {
};
FailureOr<ClusterInfo> getAndValidateClusterInfo(gpu::SubgroupReduceOp op,
- unsigned subgroupSize);
+ unsigned subgroupSize);
FailureOr<Value>
createSubgroupDPPReduction(PatternRewriter &rewriter, gpu::SubgroupReduceOp op,
diff --git a/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp b/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
index 57af63cbe..7f5e38b79 100644
--- a/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
+++ b/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
@@ -161,7 +161,8 @@ struct ScalarizeSingleElementReduce final
// std::optional<uint32_t> clusterSize = op.getClusterSize();
// assert(!clusterSize ||
-// llvm::isPowerOf2_32(*clusterSize)); // Verifier should've caught this.
+// llvm::isPowerOf2_32(*clusterSize)); // Verifier should've caught
+// this.
// if (clusterSize && *clusterSize > subgroupSize)
// return op.emitOpError()
// << "cluster size " << *clusterSize
@@ -169,8 +170,8 @@ struct ScalarizeSingleElementReduce final
// unsigned effectiveClusterSize = clusterSize.value_or(subgroupSize);
// auto clusterStride = op.getClusterStride();
-// assert(llvm::isPowerOf2_32(clusterStride)); // Verifier should've caught this.
-// if (clusterStride >= subgroupSize)
+// assert(llvm::isPowerOf2_32(clusterStride)); // Verifier should've caught
+// this. if (clusterStride >= subgroupSize)
// return op.emitOpError()
// << "cluster stride " << clusterStride
// << " is not less than subgroup size " << subgroupSize;
@@ -369,7 +370,8 @@ private:
};
// FailureOr<Value>
-// createSubgroupDPPReduction(PatternRewriter &rewriter, gpu::SubgroupReduceOp op,
+// createSubgroupDPPReduction(PatternRewriter &rewriter, gpu::SubgroupReduceOp
+// op,
// Value input, gpu::AllReduceOperation mode,
// const ClusterInfo &ci, amdgpu::Chipset chipset) {
// Location loc = op.getLoc();
@@ -382,18 +384,22 @@ private:
// // Perform reduction between all lanes N <-> N+1.
// dpp = rewriter.create<amdgpu::DPPOp>(
// loc, res.getType(), res, res, amdgpu::DPPPerm::quad_perm,
-// rewriter.getI32ArrayAttr({1, 0, 3, 2}), allRows, allBanks, boundCtrl);
+// rewriter.getI32ArrayAttr({1, 0, 3, 2}), allRows, allBanks,
+// boundCtrl);
// res = vector::makeArithReduction(rewriter, loc,
-// gpu::convertReductionKind(mode), res, dpp);
+// gpu::convertReductionKind(mode), res,
+// dpp);
// }
// if (ci.clusterSize >= 4) {
// // Perform reduction between all lanes N <-> N+2.
// dpp = rewriter.create<amdgpu::DPPOp>(
// loc, res.getType(), res, res, amdgpu::DPPPerm::quad_perm,
-// rewriter.getI32ArrayAttr({2, 3, 0, 1}), allRows, allBanks, boundCtrl);
+// rewriter.getI32ArrayAttr({2, 3, 0, 1}), allRows, allBanks,
+// boundCtrl);
// res = vector::makeArithReduction(rewriter, loc,
-// gpu::convertReductionKind(mode), res, dpp);
+// gpu::convertReductionKind(mode), res,
+// dpp);
// }
// if (ci.clusterSize >= 8) {
// // Perform reduction between all lanes N <-> 7-N,
@@ -402,16 +408,18 @@ private:
// loc, res.getType(), res, res, amdgpu::DPPPerm::row_half_mirror,
// rewriter.getUnitAttr(), allRows, allBanks, boundCtrl);
// res = vector::makeArithReduction(rewriter, loc,
-// gpu::convertReductionKind(mode), res, dpp);
+// gpu::convertReductionKind(mode), res,
+// dpp);
// }
// if (ci.clusterSize >= 16) {
// // Perform reduction between all lanes N <-> 15-N,
-// // e.g lane[0] <-> lane[15], lane[1] <-> lane[14]..., lane[7] <-> lane[8].
-// dpp = rewriter.create<amdgpu::DPPOp>(
+// // e.g lane[0] <-> lane[15], lane[1] <-> lane[14]..., lane[7] <->
+// lane[8]. dpp = rewriter.create<amdgpu::DPPOp>(
// loc, res.getType(), res, res, amdgpu::DPPPerm::row_mirror,
// rewriter.getUnitAttr(), allRows, allBanks, boundCtrl);
// res = vector::makeArithReduction(rewriter, loc,
-// gpu::convertReductionKind(mode), res, dpp);
+// gpu::convertReductionKind(mode), res,
+// dpp);
// }
// if (ci.clusterSize >= 32) {
// if (chipset.majorVersion <= 9) {
@@ -427,7 +435,8 @@ private:
// // Use a permute lane to cross rows (row 1 <-> row 0, row 3 <-> row 2).
// Value uint32Max = rewriter.create<arith::ConstantOp>(
// loc, rewriter.getI32Type(), rewriter.getI32IntegerAttr(-1));
-// dpp = rewriter.create<ROCDL::PermlaneX16Op>(loc, res.getType(), res, res,
+// dpp = rewriter.create<ROCDL::PermlaneX16Op>(loc, res.getType(), res,
+// res,
// uint32Max, uint32Max,
// /*fi=*/true,
// /*bound_ctrl=*/false);
@@ -437,7 +446,8 @@ private:
// Value lane0 = rewriter.create<arith::ConstantOp>(
// loc, rewriter.getI32Type(), rewriter.getI32IntegerAttr(0));
// res =
-// rewriter.create<ROCDL::ReadlaneOp>(loc, res.getType(), res, lane0);
+// rewriter.create<ROCDL::ReadlaneOp>(loc, res.getType(), res,
+// lane0);
// }
// } else {
// return rewriter.notifyMatchFailure(
@@ -462,15 +472,17 @@ private:
// loc, rewriter.getI32Type(), rewriter.getI32IntegerAttr(0));
// Value lane32 = rewriter.create<arith::ConstantOp>(
// loc, rewriter.getI32Type(), rewriter.getI32IntegerAttr(32));
-// dpp = rewriter.create<ROCDL::ReadlaneOp>(loc, res.getType(), res, lane32);
-// res = rewriter.create<ROCDL::ReadlaneOp>(loc, res.getType(), res, lane0);
+// dpp = rewriter.create<ROCDL::ReadlaneOp>(loc, res.getType(), res,
+// lane32); res = rewriter.create<ROCDL::ReadlaneOp>(loc, res.getType(),
+// res, lane0);
// } else {
// return rewriter.notifyMatchFailure(
// op, "Subgroup reduce lowering to DPP not currently supported for "
// "this device.");
// }
// res = vector::makeArithReduction(rewriter, loc,
-// gpu::convertReductionKind(mode), res, dpp);
+// gpu::convertReductionKind(mode), res,
+// dpp);
// }
// assert(res.getType() == input.getType());
// return res;
@@ -484,8 +496,9 @@ struct ScalarSubgroupReduceToDPP final
ScalarSubgroupReduceToDPP(MLIRContext *ctx, unsigned subgroupSize,
unsigned shuffleBitwidth, bool matchClustered,
amdgpu::Chipset chipset, PatternBenefit benefit)
- : OpRewritePattern(ctx, benefit), subgroupSize(subgroupSize), shuffleBitwidth(shuffleBitwidth),
- matchClustered(matchClustered), chipset(chipset) {}
+ : OpRewritePattern(ctx, benefit), subgroupSize(subgroupSize),
+ shuffleBitwidth(shuffleBitwidth), matchClustered(matchClustered),
+ chipset(chipset) {}
LogicalResult matchAndRewrite(gpu::SubgroupReduceOp op,
PatternRewriter &rewriter) const override {
@@ -540,8 +553,9 @@ struct ScalarSubgroupReduceToDPP final
return rewriter.create<arith::BitcastOp>(loc, valueTy, asInt);
};
- FailureOr<Value> dpp = createSubgroupDPPReduction(
- rewriter, op, op.getValue(), op.getOp(), *ci, chipset, packFn, unpackFn);
+ FailureOr<Value> dpp =
+ createSubgroupDPPReduction(rewriter, op, op.getValue(), op.getOp(), *ci,
+ chipset, packFn, unpackFn);
if (failed(dpp))
return failure();
diff --git a/mlir/lib/Dialect/GPU/Utils/ReductionUtils.cpp b/mlir/lib/Dialect/GPU/Utils/ReductionUtils.cpp
index 2f50a1ec8..a0ef7f329 100644
--- a/mlir/lib/Dialect/GPU/Utils/ReductionUtils.cpp
+++ b/mlir/lib/Dialect/GPU/Utils/ReductionUtils.cpp
@@ -10,11 +10,11 @@
//
//===----------------------------------------------------------------------===//
-#include "mlir/Dialect/AMDGPU/Utils/Chipset.h"
-#include "mlir/Dialect/GPU/Utils/GPUUtils.h"
#include "mlir/Dialect/GPU/Utils/ReductionUtils.h"
+#include "mlir/Dialect/AMDGPU/Utils/Chipset.h"
#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/GPU/Utils/GPUUtils.h"
#include "mlir/Dialect/Vector/IR/VectorOps.h"
#include "mlir/IR/Value.h"
#include "mlir/Interfaces/FunctionInterfaces.h"
@@ -24,7 +24,7 @@
using namespace mlir;
FailureOr<ClusterInfo> mlir::getAndValidateClusterInfo(gpu::SubgroupReduceOp op,
- unsigned subgroupSize) {
+ unsigned subgroupSize) {
assert(llvm::isPowerOf2_32(subgroupSize));
std::optional<uint32_t> clusterSize = op.getClusterSize();
@@ -51,7 +51,7 @@ FailureOr<Value> mlir::createSubgroupDPPReduction(
gpu::AllReduceOperation mode, const ClusterInfo &ci,
amdgpu::Chipset chipset, function_ref<Value(Value)> packFn,
function_ref<Value(Value)> unpackFn) {
-
+
Location loc = op.getLoc();
Value dpp;
Value res = input;
diff --git a/mlir/test/lib/Dialect/GPU/TestGpuRewrite.cpp b/mlir/test/lib/Dialect/GPU/TestGpuRewrite.cpp
index 4ebcf897f..fd8b34288 100644
--- a/mlir/test/lib/Dialect/GPU/TestGpuRewrite.cpp
+++ b/mlir/test/lib/Dialect/GPU/TestGpuRewrite.cpp
@@ -93,9 +93,11 @@ struct TestGpuSubgroupReduceLoweringPass
auto maybeChipset = amdgpu::Chipset::parse(target);
if (succeeded(maybeChipset)) {
populateGpuLowerSubgroupReduceToDPPPatterns(
- patterns, /*subgroupSize=*/64, /*shuffleBitwidth=*/32, *maybeChipset, PatternBenefit(2));
+ patterns, /*subgroupSize=*/64, /*shuffleBitwidth=*/32,
+ *maybeChipset, PatternBenefit(2));
populateGpuLowerClusteredSubgroupReduceToDPPPatterns(
- patterns, /*subgroupSize=*/64, /*shuffleBitwidth=*/32, *maybeChipset, PatternBenefit(2));
+ patterns, /*subgroupSize=*/64, /*shuffleBitwidth=*/32,
+ *maybeChipset, PatternBenefit(2));
}
populateGpuLowerSubgroupReduceToShufflePatterns(
patterns, /*subgroupSize=*/32, /*shuffleBitwidth=*/32);
|
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Following up on PR 133204, we add further improvements to the DPP implementation of subgroup reduce.