-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flang] Inline hlfir.matmul[_transpose]. #122821
Conversation
Inlining `hlfir.matmul` as `hlfir.eval_in_mem` does not allow to get rid of a temporary array in many cases, but it may still be much better allowing to: * Get rid of any overhead related to calling runtime MATMUL (such as descriptors creation). * Use CPU-specific vectorization cost model for matmul loops, which Fortran runtime cannot currently do. * Optimize matmul of known-size arrays by complete unrolling. One of the drawbacks of `hlfir.eval_in_mem` inlining is that the ops inside it with store memory effects block the current MLIR CSE, so I decided to run this inlining late in the pipeline. There is a source commen explaining the CSE issue in more detail. Straightforward inlining of `hlfir.matmul` as an `hlfir.elemental` is not good for performance, and I got performance regressions with it comparing to Fortran runtime implementation. I put it under an enigneering option for experiments. At the same time, inlining `hlfir.matmul_transpose` as `hlfir.elemental` seems to be a good approach, e.g. it allows getting rid of a temporay array in cases like: `A(:)=B(:)+MATMUL(TRANSPOSE(C(:,:)),D(:))`. This patch improves performance of galgel and tonto a little bit.
@llvm/pr-subscribers-flang-fir-hlfir @llvm/pr-subscribers-flang-driver Author: Slava Zakharin (vzakhari) ChangesInlining
One of the drawbacks of Straightforward inlining of At the same time, inlining This patch improves performance of galgel and tonto a little bit. Patch is 86.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122821.diff 10 Files Affected:
diff --git a/flang/include/flang/Optimizer/Builder/FIRBuilder.h b/flang/include/flang/Optimizer/Builder/FIRBuilder.h
index c5d86e713f253a..ea658fb16a36c3 100644
--- a/flang/include/flang/Optimizer/Builder/FIRBuilder.h
+++ b/flang/include/flang/Optimizer/Builder/FIRBuilder.h
@@ -804,6 +804,15 @@ elideLengthsAlreadyInType(mlir::Type type, mlir::ValueRange lenParams);
/// Get the address space which should be used for allocas
uint64_t getAllocaAddressSpace(mlir::DataLayout *dataLayout);
+/// The two vectors of MLIR values have the following property:
+/// \p extents1[i] must have the same value as \p extents2[i]
+/// The function returns a new vector of MLIR values that preserves
+/// the same property vs \p extents1 and \p extents2, but allows
+/// more optimizations. For example, if extents1[j] is a known constant,
+/// and extents2[j] is not, then result[j] is the MLIR value extents1[j].
+llvm::SmallVector<mlir::Value> deduceOptimalExtents(mlir::ValueRange extents1,
+ mlir::ValueRange extents2);
+
} // namespace fir::factory
#endif // FORTRAN_OPTIMIZER_BUILDER_FIRBUILDER_H
diff --git a/flang/include/flang/Optimizer/Builder/HLFIRTools.h b/flang/include/flang/Optimizer/Builder/HLFIRTools.h
index c8aad644bc784a..6e85b8f4ddf86e 100644
--- a/flang/include/flang/Optimizer/Builder/HLFIRTools.h
+++ b/flang/include/flang/Optimizer/Builder/HLFIRTools.h
@@ -508,6 +508,11 @@ genTypeAndKindConvert(mlir::Location loc, fir::FirOpBuilder &builder,
hlfir::Entity source, mlir::Type toType,
bool preserveLowerBounds);
+/// A shortcut for loadTrivialScalar(getElementAt()),
+/// which designates and loads an element of an array.
+Entity loadElementAt(mlir::Location loc, fir::FirOpBuilder &builder,
+ Entity entity, mlir::ValueRange oneBasedIndices);
+
} // namespace hlfir
#endif // FORTRAN_OPTIMIZER_BUILDER_HLFIRTOOLS_H
diff --git a/flang/include/flang/Optimizer/HLFIR/Passes.td b/flang/include/flang/Optimizer/HLFIR/Passes.td
index 644f1e3c3af2b8..90cf6e74241bd0 100644
--- a/flang/include/flang/Optimizer/HLFIR/Passes.td
+++ b/flang/include/flang/Optimizer/HLFIR/Passes.td
@@ -43,6 +43,17 @@ def LowerHLFIROrderedAssignments : Pass<"lower-hlfir-ordered-assignments", "::ml
def SimplifyHLFIRIntrinsics : Pass<"simplify-hlfir-intrinsics"> {
let summary = "Simplify HLFIR intrinsic operations that don't need to result in runtime calls";
+ let options = [Option<"allowNewSideEffects", "allow-new-side-effects", "bool",
+ /*default=*/"false",
+ "If enabled, then the HLFIR operations simplification "
+ "may introduce operations with side effects. "
+ "For example, hlfir.matmul may be inlined as "
+ "and hlfir.eval_in_mem with hlfir.assign inside it."
+ "The hlfir.assign has a write effect on the memory "
+ "argument of hlfir.eval_in_mem, which may block "
+ "some existing MLIR transformations (e.g. CSE) "
+ "that otherwise would have been possible across "
+ "the hlfir.matmul.">];
}
def InlineElementals : Pass<"inline-elementals"> {
diff --git a/flang/lib/Optimizer/Builder/FIRBuilder.cpp b/flang/lib/Optimizer/Builder/FIRBuilder.cpp
index d01becfe800937..218f98ef9ef429 100644
--- a/flang/lib/Optimizer/Builder/FIRBuilder.cpp
+++ b/flang/lib/Optimizer/Builder/FIRBuilder.cpp
@@ -1740,3 +1740,17 @@ uint64_t fir::factory::getAllocaAddressSpace(mlir::DataLayout *dataLayout) {
return mlir::cast<mlir::IntegerAttr>(addrSpace).getUInt();
return 0;
}
+
+llvm::SmallVector<mlir::Value>
+fir::factory::deduceOptimalExtents(mlir::ValueRange extents1,
+ mlir::ValueRange extents2) {
+ llvm::SmallVector<mlir::Value> extents;
+ extents.reserve(extents1.size());
+ for (auto [extent1, extent2] : llvm::zip(extents1, extents2)) {
+ if (!fir::getIntIfConstant(extent1) && fir::getIntIfConstant(extent2))
+ extents.push_back(extent2);
+ else
+ extents.push_back(extent1);
+ }
+ return extents;
+}
diff --git a/flang/lib/Optimizer/Builder/HLFIRTools.cpp b/flang/lib/Optimizer/Builder/HLFIRTools.cpp
index 94238bc24e453d..5e5d0bbd681326 100644
--- a/flang/lib/Optimizer/Builder/HLFIRTools.cpp
+++ b/flang/lib/Optimizer/Builder/HLFIRTools.cpp
@@ -939,8 +939,10 @@ llvm::SmallVector<mlir::Value> hlfir::genLoopNestWithReductions(
doLoop = builder.create<fir::DoLoopOp>(loc, one, ub, one, isUnordered,
/*finalCountValue=*/false,
parentLoop.getRegionIterArgs());
- // Return the results of the child loop from its parent loop.
- builder.create<fir::ResultOp>(loc, doLoop.getResults());
+ if (!reductionInits.empty()) {
+ // Return the results of the child loop from its parent loop.
+ builder.create<fir::ResultOp>(loc, doLoop.getResults());
+ }
}
builder.setInsertionPointToStart(doLoop.getBody());
@@ -955,7 +957,8 @@ llvm::SmallVector<mlir::Value> hlfir::genLoopNestWithReductions(
reductionValues =
genBody(loc, builder, oneBasedIndices, parentLoop.getRegionIterArgs());
builder.setInsertionPointToEnd(parentLoop.getBody());
- builder.create<fir::ResultOp>(loc, reductionValues);
+ if (!reductionValues.empty())
+ builder.create<fir::ResultOp>(loc, reductionValues);
builder.setInsertionPointAfter(outerLoop);
return outerLoop->getResults();
}
@@ -1410,3 +1413,11 @@ void hlfir::computeEvaluateOpIn(mlir::Location loc, fir::FirOpBuilder &builder,
builder.clone(op, mapper);
return;
}
+
+hlfir::Entity hlfir::loadElementAt(mlir::Location loc,
+ fir::FirOpBuilder &builder,
+ hlfir::Entity entity,
+ mlir::ValueRange oneBasedIndices) {
+ return loadTrivialScalar(loc, builder,
+ getElementAt(loc, builder, entity, oneBasedIndices));
+}
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/SimplifyHLFIRIntrinsics.cpp b/flang/lib/Optimizer/HLFIR/Transforms/SimplifyHLFIRIntrinsics.cpp
index 314ced8679521a..0fd535b4290799 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/SimplifyHLFIRIntrinsics.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/SimplifyHLFIRIntrinsics.cpp
@@ -28,6 +28,13 @@ namespace hlfir {
#include "flang/Optimizer/HLFIR/Passes.h.inc"
} // namespace hlfir
+#define DEBUG_TYPE "simplify-hlfir-intrinsics"
+
+static llvm::cl::opt<bool> forceMatmulAsElemental(
+ "flang-inline-matmul-as-elemental",
+ llvm::cl::desc("Expand hlfir.matmul as elemental operation"),
+ llvm::cl::init(false));
+
namespace {
class TransposeAsElementalConversion
@@ -467,9 +474,438 @@ class CShiftAsElementalConversion
}
};
+template <typename Op>
+class MatmulConversion : public mlir::OpRewritePattern<Op> {
+public:
+ using mlir::OpRewritePattern<Op>::OpRewritePattern;
+
+ llvm::LogicalResult
+ matchAndRewrite(Op matmul, mlir::PatternRewriter &rewriter) const override {
+ mlir::Location loc = matmul.getLoc();
+ fir::FirOpBuilder builder{rewriter, matmul.getOperation()};
+ hlfir::Entity lhs = hlfir::Entity{matmul.getLhs()};
+ hlfir::Entity rhs = hlfir::Entity{matmul.getRhs()};
+ mlir::Value resultShape, innerProductExtent;
+ std::tie(resultShape, innerProductExtent) =
+ genResultShape(loc, builder, lhs, rhs);
+
+ if (forceMatmulAsElemental || isMatmulTranspose) {
+ // Generate hlfir.elemental that produces the result of
+ // MATMUL/MATMUL(TRANSPOSE).
+ // Note that this implementation is very suboptimal for MATMUL,
+ // but is quite good for MATMUL(TRANSPOSE), e.g.:
+ // R(1:N) = R(1:N) + MATMUL(TRANSPOSE(X(1:N,1:N)), Y(1:N))
+ // Inlining MATMUL(TRANSPOSE) as hlfir.elemental may result
+ // in merging the inner product computation with the elemental
+ // addition. Note that the inner product computation will
+ // benefit from processing the lowermost dimensions of X and Y,
+ // which may be the best when they are contiguous.
+ //
+ // This is why we always inline MATMUL(TRANSPOSE) as an elemental.
+ // MATMUL is inlined below by default unless forceMatmulAsElemental.
+ hlfir::ExprType resultType =
+ mlir::cast<hlfir::ExprType>(matmul.getType());
+ hlfir::ElementalOp newOp = genElementalMatmul(
+ loc, builder, resultType, resultShape, lhs, rhs, innerProductExtent);
+ rewriter.replaceOp(matmul, newOp);
+ return mlir::success();
+ }
+
+ // Generate hlfir.eval_in_mem to mimic the MATMUL implementation
+ // from Fortran runtime. The implementation needs to operate
+ // with the result array as an in-memory object.
+ hlfir::EvaluateInMemoryOp evalOp =
+ builder.create<hlfir::EvaluateInMemoryOp>(
+ loc, mlir::cast<hlfir::ExprType>(matmul.getType()), resultShape);
+ builder.setInsertionPointToStart(&evalOp.getBody().front());
+
+ // Embox the raw array pointer to simplify designating it.
+ // TODO: this currently results in redundant lower bounds
+ // addition for the designator, but this should be fixed in
+ // hlfir::Entity::mayHaveNonDefaultLowerBounds().
+ mlir::Value resultArray = evalOp.getMemory();
+ mlir::Type arrayType = fir::dyn_cast_ptrEleTy(resultArray.getType());
+ resultArray = builder.createBox(loc, fir::BoxType::get(arrayType),
+ resultArray, resultShape, /*slice=*/nullptr,
+ /*lengths=*/{}, /*tdesc=*/nullptr);
+
+ // The contiguous MATMUL version is best for the cases
+ // where the input arrays and (maybe) the result are contiguous
+ // in their lowermost dimensions.
+ // Especially, when LLVM can recognize the continuity
+ // and vectorize the loops properly.
+ // TODO: we need to recognize the cases when the continuity
+ // is not statically obvious and try to generate an explicitly
+ // continuous version under a dynamic check. The fallback
+ // implementation may use genElementalMatmul() with
+ // an hlfir.assign into the result of eval_in_mem.
+ mlir::LogicalResult rewriteResult =
+ genContiguousMatmul(loc, builder, hlfir::Entity{resultArray},
+ resultShape, lhs, rhs, innerProductExtent);
+
+ if (mlir::failed(rewriteResult)) {
+ // Erase the unclaimed eval_in_mem op.
+ rewriter.eraseOp(evalOp);
+ return rewriter.notifyMatchFailure(matmul,
+ "genContiguousMatmul() failed");
+ }
+
+ rewriter.replaceOp(matmul, evalOp);
+ return mlir::success();
+ }
+
+private:
+ static constexpr bool isMatmulTranspose =
+ std::is_same_v<Op, hlfir::MatmulTransposeOp>;
+
+ // Return a tuple of:
+ // * A fir.shape operation representing the shape of the result
+ // of a MATMUL/MATMUL(TRANSPOSE).
+ // * An extent of the dimensions of the input array
+ // that are processed during the inner product computation.
+ static std::tuple<mlir::Value, mlir::Value>
+ genResultShape(mlir::Location loc, fir::FirOpBuilder &builder,
+ hlfir::Entity input1, hlfir::Entity input2) {
+ mlir::Value input1Shape = hlfir::genShape(loc, builder, input1);
+ llvm::SmallVector<mlir::Value> input1Extents =
+ hlfir::getExplicitExtentsFromShape(input1Shape, builder);
+ if (input1Shape.getUses().empty())
+ input1Shape.getDefiningOp()->erase();
+ mlir::Value input2Shape = hlfir::genShape(loc, builder, input2);
+ llvm::SmallVector<mlir::Value> input2Extents =
+ hlfir::getExplicitExtentsFromShape(input2Shape, builder);
+ if (input2Shape.getUses().empty())
+ input2Shape.getDefiningOp()->erase();
+
+ llvm::SmallVector<mlir::Value, 2> newExtents;
+ mlir::Value innerProduct1Extent, innerProduct2Extent;
+ if (input1Extents.size() == 1) {
+ assert(!isMatmulTranspose &&
+ "hlfir.matmul_transpose's first operand must be rank-2 array");
+ assert(input2Extents.size() == 2 &&
+ "hlfir.matmul second argument must be rank-2 array");
+ newExtents.push_back(input2Extents[1]);
+ innerProduct1Extent = input1Extents[0];
+ innerProduct2Extent = input2Extents[0];
+ } else {
+ if (input2Extents.size() == 1) {
+ assert(input1Extents.size() == 2 &&
+ "hlfir.matmul first argument must be rank-2 array");
+ if constexpr (isMatmulTranspose)
+ newExtents.push_back(input1Extents[1]);
+ else
+ newExtents.push_back(input1Extents[0]);
+ } else {
+ assert(input1Extents.size() == 2 && input2Extents.size() == 2 &&
+ "hlfir.matmul arguments must be rank-2 arrays");
+ if constexpr (isMatmulTranspose)
+ newExtents.push_back(input1Extents[1]);
+ else
+ newExtents.push_back(input1Extents[0]);
+
+ newExtents.push_back(input2Extents[1]);
+ }
+ if constexpr (isMatmulTranspose)
+ innerProduct1Extent = input1Extents[0];
+ else
+ innerProduct1Extent = input1Extents[1];
+
+ innerProduct2Extent = input2Extents[0];
+ }
+ // The inner product dimensions of the input arrays
+ // must match. Pick the best (e.g. constant) out of them
+ // so that the inner product loop bound can be used in
+ // optimizations.
+ llvm::SmallVector<mlir::Value> innerProductExtent =
+ fir::factory::deduceOptimalExtents({innerProduct1Extent},
+ {innerProduct2Extent});
+ return {builder.create<fir::ShapeOp>(loc, newExtents),
+ innerProductExtent[0]};
+ }
+
+ static mlir::Value castToProductType(mlir::Location loc,
+ fir::FirOpBuilder &builder,
+ mlir::Value value, mlir::Type type) {
+ if (mlir::isa<fir::LogicalType>(type))
+ return builder.createConvert(loc, builder.getIntegerType(1), value);
+
+ if (fir::isa_complex(type) && !fir::isa_complex(value.getType())) {
+ mlir::Value zeroCmplx = fir::factory::createZeroValue(builder, loc, type);
+ fir::factory::Complex helper(builder, loc);
+ mlir::Type partType = helper.getComplexPartType(type);
+ return helper.insertComplexPart(
+ zeroCmplx, castToProductType(loc, builder, value, partType),
+ /*isImagPart=*/false);
+ }
+ return builder.createConvert(loc, type, value);
+ }
+
+ // Generate an update of the inner product value:
+ // acc += v1 * v2, OR
+ // acc ||= v1 && v2
+ static mlir::Value genAccumulateProduct(mlir::Location loc,
+ fir::FirOpBuilder &builder,
+ mlir::Type resultType,
+ mlir::Value acc, mlir::Value v1,
+ mlir::Value v2) {
+ acc = castToProductType(loc, builder, acc, resultType);
+ v1 = castToProductType(loc, builder, v1, resultType);
+ v2 = castToProductType(loc, builder, v2, resultType);
+ mlir::Value result;
+ if (mlir::isa<mlir::FloatType>(resultType))
+ result = builder.create<mlir::arith::AddFOp>(
+ loc, acc, builder.create<mlir::arith::MulFOp>(loc, v1, v2));
+ else if (mlir::isa<mlir::ComplexType>(resultType))
+ result = builder.create<fir::AddcOp>(
+ loc, acc, builder.create<fir::MulcOp>(loc, v1, v2));
+ else if (mlir::isa<mlir::IntegerType>(resultType))
+ result = builder.create<mlir::arith::AddIOp>(
+ loc, acc, builder.create<mlir::arith::MulIOp>(loc, v1, v2));
+ else if (mlir::isa<fir::LogicalType>(resultType))
+ result = builder.create<mlir::arith::OrIOp>(
+ loc, acc, builder.create<mlir::arith::AndIOp>(loc, v1, v2));
+ else
+ llvm_unreachable("unsupported type");
+
+ return builder.createConvert(loc, resultType, result);
+ }
+
+ static mlir::LogicalResult
+ genContiguousMatmul(mlir::Location loc, fir::FirOpBuilder &builder,
+ hlfir::Entity result, mlir::Value resultShape,
+ hlfir::Entity lhs, hlfir::Entity rhs,
+ mlir::Value innerProductExtent) {
+ // This code does not support MATMUL(TRANSPOSE), and it is supposed
+ // to be inlined as hlfir.elemental.
+ if constexpr (isMatmulTranspose)
+ return mlir::failure();
+
+ mlir::OpBuilder::InsertionGuard guard(builder);
+ mlir::Type resultElementType = result.getFortranElementType();
+ llvm::SmallVector<mlir::Value, 2> resultExtents =
+ mlir::cast<fir::ShapeOp>(resultShape.getDefiningOp()).getExtents();
+
+ // The inner product loop may be unordered if FastMathFlags::reassoc
+ // transformations are allowed. The integer/logical inner product is
+ // always unordered.
+ // Note that isUnordered is currently applied to all loops
+ // in the loop nests generated below, while it has to be applied
+ // only to one.
+ bool isUnordered = mlir::isa<mlir::IntegerType>(resultElementType) ||
+ mlir::isa<fir::LogicalType>(resultElementType) ||
+ static_cast<bool>(builder.getFastMathFlags() &
+ mlir::arith::FastMathFlags::reassoc);
+
+ // Insert the initialization loop nest that fills the whole result with
+ // zeroes.
+ mlir::Value initValue =
+ fir::factory::createZeroValue(builder, loc, resultElementType);
+ auto genInitBody = [&](mlir::Location loc, fir::FirOpBuilder &builder,
+ mlir::ValueRange oneBasedIndices,
+ mlir::ValueRange reductionArgs)
+ -> llvm::SmallVector<mlir::Value, 0> {
+ hlfir::Entity resultElement =
+ hlfir::getElementAt(loc, builder, result, oneBasedIndices);
+ // builder.create<fir::StoreOp>(loc, initValue, resultElement);
+ builder.create<hlfir::AssignOp>(loc, initValue, resultElement);
+ return {};
+ };
+
+ hlfir::genLoopNestWithReductions(loc, builder, resultExtents,
+ /*reductionInits=*/{}, genInitBody,
+ /*isUnordered=*/true);
+
+ if (lhs.getRank() == 2 && rhs.getRank() == 2) {
+ // LHS(NROWS,N) * RHS(N,NCOLS) -> RESULT(NROWS,NCOLS)
+ //
+ // Insert the computation loop nest:
+ // DO 2 K = 1, N
+ // DO 2 J = 1, NCOLS
+ // DO 2 I = 1, NROWS
+ // 2 RESULT(I,J) = RESULT(I,J) + LHS(I,K)*RHS(K,J)
+ auto genMatrixMatrix = [&](mlir::Location loc, fir::FirOpBuilder &builder,
+ mlir::ValueRange oneBasedIndices,
+ mlir::ValueRange reductionArgs)
+ -> llvm::SmallVector<mlir::Value, 0> {
+ mlir::Value I = oneBasedIndices[0];
+ mlir::Value J = oneBasedIndices[1];
+ mlir::Value K = oneBasedIndices[2];
+ hlfir::Entity resultElement =
+ hlfir::getElementAt(loc, builder, result, {I, J});
+ hlfir::Entity resultElementValue =
+ hlfir::loadTrivialScalar(loc, builder, resultElement);
+ hlfir::Entity lhsElementValue =
+ hlfir::loadElementAt(loc, builder, lhs, {I, K});
+ hlfir::Entity rhsElementValue =
+ hlfir::loadElementAt(loc, builder, rhs, {K, J});
+ mlir::Value productValue = genAccumulateProduct(
+ loc, builder, resultElementType, resultElementValue,
+ lhsElementValue, rhsElementValue);
+ builder.create<hlfir::AssignOp>(loc, productValue, resultElement);
+ // builder.create<fir::StoreOp>(loc, productValue,
+ // resultElement);
+ return {};
+ };
+
+ // Note that the loops are inserted in reverse order,
+ // so innerProductExtent should be passed as the last extent.
+ hlfir::genLoopNestWithReductions(
+ loc, builder,
+ {resultExtents[0], resultExtents[1], innerProductExtent},
+ /*reductionInits=*/{}, genMatrixMatrix, isUnordered);
+ return mlir::success();
+ }
+
+ if (lhs.getRank() == 2 && rhs.getRank() == 1) {
+ // LHS(NROWS,N) * RHS(N) -> RESULT(NROWS)
+ //
+ ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Slava, that is an interesting usage of hlfir.eval_in_mem!
I wonder if we could improve the array accesses order with the hlfir.elemental case in bufferization by re-ordering the loops, but that does not sound trivial.
Please wait for Tom's feedback on ARM performance.
return builder.createConvert(loc, builder.getIntegerType(1), value); | ||
|
||
if (fir::isa_complex(type) && !fir::isa_complex(value.getType())) { | ||
mlir::Value zeroCmplx = fir::factory::createZeroValue(builder, loc, type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it beneficial to special case complex * real
? Maybe the multiplied by zero is optimized out later though.
I am not asking you to do it, just curious. If it would be better to special case it, you can add a TODO note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, InstCombinePass
in LLVM gets rid of the multiplications by zero under -fno-signed-zeros -fno-honor-nans
. I guess we can special case it here and avoid them by default:
- we can do it by expanding the complex/real multiplication with packing/repacking
- or by setting
nnan nsz
to such mulc/addc operations.
I will add a TODO note. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't LLVM do this by default? I can't see a precision issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because we cannot replace X * 0.0
with 0.0
if X
might be NaN or a signed zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is in general, but in case of complex * real
it should be correct.
// an hlfir.assign into the result of eval_in_mem. | ||
mlir::LogicalResult rewriteResult = | ||
genContiguousMatmul(loc, builder, hlfir::Entity{resultArray}, | ||
resultShape, lhs, rhs, innerProductExtent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I get the comment here.
Is this generating code that is optimal for the cases where the input arrays are contiguous in the inner dimension but is always valid, or is this generating code that is only valid when the input arrays are contiguous in the inner dimension?
I do not see a contiguity check, so I imagine this is the former, but it is not really clear to me from the comment and name that it is still correct if the input is not contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is the former :) I will update the comment.
Basically, we want to get from this:
to this:
It seems to me that it is too much to do in hlfir.elemental bufferization :) Another point to investigate is what we should do for the parallel execution models. The straightforward implementation may be parallelized across rows and columns, but the accesses will be sparse for That is why I would like to keep a possibility to inline it both ways and continue experimenting. |
Thanks for the details, this makes total sense to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Slava this looks great.
I don't see any performance regressions on aarch64.
return builder.createConvert(loc, builder.getIntegerType(1), value); | ||
|
||
if (fir::isa_complex(type) && !fir::isa_complex(value.getType())) { | ||
mlir::Value zeroCmplx = fir::factory::createZeroValue(builder, loc, type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't LLVM do this by default? I can't see a precision issue.
The tests started failing at -O2 after llvm/llvm-project#122821 Flang inlines hlfir.matmul[_transpose] after the change, so the runtime is not called, and it does not make the bounds checks. Related feature request: https://github.com/orgs/llvm/projects/12?pane=issue&itemId=29048733
This seems to be breaking the gfortran testsuite: https://lab.llvm.org/buildbot/#/builders/198/builds/1165 |
Thanks, I am fixing it in llvm/llvm-test-suite#201 |
The tests started failing at -O2 after llvm/llvm-project#122821 Flang inlines hlfir.matmul[_transpose] after the change, so the runtime is not called, and it does not make the bounds checks. Related feature request: https://github.com/orgs/llvm/projects/12?pane=issue&itemId=29048733
Inlining `hlfir.matmul` as `hlfir.eval_in_mem` does not allow to get rid of a temporary array in many cases, but it may still be much better allowing to: * Get rid of any overhead related to calling runtime MATMUL (such as descriptors creation). * Use CPU-specific vectorization cost model for matmul loops, which Fortran runtime cannot currently do. * Optimize matmul of known-size arrays by complete unrolling. One of the drawbacks of `hlfir.eval_in_mem` inlining is that the ops inside it with store memory effects block the current MLIR CSE, so I decided to run this inlining late in the pipeline. There is a source commen explaining the CSE issue in more detail. Straightforward inlining of `hlfir.matmul` as an `hlfir.elemental` is not good for performance, and I got performance regressions with it comparing to Fortran runtime implementation. I put it under an enigneering option for experiments. At the same time, inlining `hlfir.matmul_transpose` as `hlfir.elemental` seems to be a good approach, e.g. it allows getting rid of a temporay array in cases like: `A(:)=B(:)+MATMUL(TRANSPOSE(C(:,:)),D(:))`. This patch improves performance of galgel and tonto a little bit.
Inlining `hlfir.matmul` as `hlfir.eval_in_mem` does not allow to get rid of a temporary array in many cases, but it may still be much better allowing to: * Get rid of any overhead related to calling runtime MATMUL (such as descriptors creation). * Use CPU-specific vectorization cost model for matmul loops, which Fortran runtime cannot currently do. * Optimize matmul of known-size arrays by complete unrolling. One of the drawbacks of `hlfir.eval_in_mem` inlining is that the ops inside it with store memory effects block the current MLIR CSE, so I decided to run this inlining late in the pipeline. There is a source commen explaining the CSE issue in more detail. Straightforward inlining of `hlfir.matmul` as an `hlfir.elemental` is not good for performance, and I got performance regressions with it comparing to Fortran runtime implementation. I put it under an enigneering option for experiments. At the same time, inlining `hlfir.matmul_transpose` as `hlfir.elemental` seems to be a good approach, e.g. it allows getting rid of a temporay array in cases like: `A(:)=B(:)+MATMUL(TRANSPOSE(C(:,:)),D(:))`. This patch improves performance of galgel and tonto a little bit.
Inlining
hlfir.matmul
ashlfir.eval_in_mem
does not allowto get rid of a temporary array in many cases, but it may still be
much better allowing to:
(such as descriptors creation).
which Fortran runtime cannot currently do.
One of the drawbacks of
hlfir.eval_in_mem
inlining is thatthe ops inside it with store memory effects block the current
MLIR CSE, so I decided to run this inlining late in the pipeline.
There is a source commen explaining the CSE issue in more detail.
Straightforward inlining of
hlfir.matmul
as anhlfir.elemental
is not good for performance, and I got performance regressions
with it comparing to Fortran runtime implementation. I put it
under an enigneering option for experiments.
At the same time, inlining
hlfir.matmul_transpose
ashlfir.elemental
seems to be a good approach, e.g. it allows getting rid of a temporay
array in cases like:
A(:)=B(:)+MATMUL(TRANSPOSE(C(:,:)),D(:))
.This patch improves performance of galgel and tonto a little bit.