Skip to content

[WIP][flang][OpenMP] Experimental pass to map do concurrent to OMP #77285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ergawy
Copy link
Member

@ergawy ergawy commented Jan 8, 2024

Adds a pass to map do concurrent to OpenMP worksharing consturcts. For now, only maps basic loops to omp parallel do. This is still a WIP with more work needed for testing and mapping more advanced loops.

So far, the pass is not added to any flang pipelines; it only lives as a standalone pass tested by fir-opt. Once we hopefully agree on the approach and the pass matures, I will add it to flang.

@ergawy ergawy self-assigned this Jan 8, 2024
@llvmbot llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Jan 8, 2024
@ergawy ergawy requested a review from tblah January 8, 2024 08:43
@llvmbot
Copy link
Member

llvmbot commented Jan 8, 2024

@llvm/pr-subscribers-flang-fir-hlfir

Author: Kareem Ergawy (ergawy)

Changes

Adds a pass to map do concurrent to OpenMP worksharing consturcts. For now, only maps basic loops to omp parallel do. This is still a WIP with more work needed for testing and mapping more advanced loops.


Full diff: https://github.com/llvm/llvm-project/pull/77285.diff

6 Files Affected:

  • (modified) flang/include/flang/Optimizer/HLFIR/HLFIROps.td (+1-1)
  • (modified) flang/include/flang/Optimizer/Transforms/Passes.h (+2)
  • (modified) flang/include/flang/Optimizer/Transforms/Passes.td (+20)
  • (modified) flang/lib/Optimizer/Transforms/CMakeLists.txt (+1)
  • (added) flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp (+162)
  • (added) flang/test/Transforms/DoConcurrent/basic.mlir (+60)
diff --git a/flang/include/flang/Optimizer/HLFIR/HLFIROps.td b/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
index 254771ca1780cf..718026e4fdfbe7 100644
--- a/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
+++ b/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
@@ -75,7 +75,7 @@ def hlfir_DeclareOp : hlfir_Op<"declare", [AttrSizedOperandSegments,
     func.func @foo(%arg0: !fir.ref<!fir.array<?x?x!fir.char<1,?>>>, %arg1: !fir.ref<i64>) {
       %c10 = arith.constant 10 : index
       %c20 = arith.constant 20 : index
-      %1 = fir.load %ag1 : fir.ref<i64>
+      %1 = fir.load %arg1 : fir.ref<i64>
       %2 = fir.shape_shift %c10, %1, %c20, %1 : (index, index, index, index) -> !fir.shapeshift<2>
       %3 = hfir.declare %arg0(%2) typeparams %1 {uniq_name = "c"} (fir.ref<!fir.array<?x?x!fir.char<1,?>>>, fir.shapeshift<2>, index) -> (fir.box<!fir.array<?x?x!fir.char<1,?>>>, fir.ref<!fir.array<?x?x!fir.char<1,?>>>)
       // ... uses %3#0 as "c"
diff --git a/flang/include/flang/Optimizer/Transforms/Passes.h b/flang/include/flang/Optimizer/Transforms/Passes.h
index 6970da8698ae84..1f389a1f571b75 100644
--- a/flang/include/flang/Optimizer/Transforms/Passes.h
+++ b/flang/include/flang/Optimizer/Transforms/Passes.h
@@ -93,6 +93,8 @@ std::unique_ptr<mlir::Pass> createFunctionAttrPass();
 std::unique_ptr<mlir::Pass>
 createFunctionAttrPass(FunctionAttrTypes &functionAttr);
 
+std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass();
+
 // declarative passes
 #define GEN_PASS_REGISTRATION
 #include "flang/Optimizer/Transforms/Passes.h.inc"
diff --git a/flang/include/flang/Optimizer/Transforms/Passes.td b/flang/include/flang/Optimizer/Transforms/Passes.td
index e3c45d41f04cc7..41d9be4249b5e2 100644
--- a/flang/include/flang/Optimizer/Transforms/Passes.td
+++ b/flang/include/flang/Optimizer/Transforms/Passes.td
@@ -370,4 +370,24 @@ def FunctionAttr : Pass<"function-attr", "mlir::func::FuncOp"> {
   let constructor = "::fir::createFunctionAttrPass()";
 }
 
+def DoConcurrentConversionPass : Pass<"fopenmp-do-concurrent-conversion", "mlir::func::FuncOp"> {
+  let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+
+  let description = [{ This is an experimental pass to map `DO CONCURRENR` loops
+     to their correspnding equivalent OpenMP worksharing constructs.
+
+     For now the following is supported:
+       - Mapping simple loops to `parallel do`.
+
+     Still to TODO:
+       - More extensive testing.
+       - Mapping to `target teams distribute parallel do`.
+       - Allowing the user to control mapping behavior: either to the host or
+         target.
+  }];
+
+  let constructor = "::fir::createDoConcurrentConversionPass()";
+  let dependentDialects = ["mlir::omp::OpenMPDialect"];
+}
+
 #endif // FLANG_OPTIMIZER_TRANSFORMS_PASSES
diff --git a/flang/lib/Optimizer/Transforms/CMakeLists.txt b/flang/lib/Optimizer/Transforms/CMakeLists.txt
index fc067ad3585395..13a46ef7422530 100644
--- a/flang/lib/Optimizer/Transforms/CMakeLists.txt
+++ b/flang/lib/Optimizer/Transforms/CMakeLists.txt
@@ -21,6 +21,7 @@ add_flang_library(FIRTransforms
   OMPMarkDeclareTarget.cpp
   VScaleAttr.cpp
   FunctionAttr.cpp
+  DoConcurrentConversion.cpp
 
   DEPENDS
   FIRDialect
diff --git a/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp b/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp
new file mode 100644
index 00000000000000..180c0bdf672af9
--- /dev/null
+++ b/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp
@@ -0,0 +1,162 @@
+//===- DoConcurrentConversion.cpp -- map `DO CONCURRENT` to OpenMP loops --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "flang/Optimizer/Dialect/FIRDialect.h"
+#include "flang/Optimizer/Dialect/FIROps.h"
+#include "flang/Optimizer/Dialect/FIRType.h"
+#include "flang/Optimizer/Dialect/Support/FIRContext.h"
+#include "flang/Optimizer/HLFIR/HLFIRDialect.h"
+#include "flang/Optimizer/Transforms/Passes.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
+#include "mlir/IR/Diagnostics.h"
+#include "mlir/IR/IRMapping.h"
+#include "mlir/Pass/Pass.h"
+#include "mlir/Transforms/DialectConversion.h"
+
+#include <memory>
+
+namespace fir {
+#define GEN_PASS_DEF_DOCONCURRENTCONVERSIONPASS
+#include "flang/Optimizer/Transforms/Passes.h.inc"
+} // namespace fir
+
+#define DEBUG_TYPE "fopenmp-do-concurrent-conversion"
+
+namespace {
+class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
+public:
+  using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
+
+  mlir::LogicalResult
+  matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
+                  mlir::ConversionPatternRewriter &rewriter) const override {
+    mlir::OpPrintingFlags flags;
+    flags.printGenericOpForm();
+
+    mlir::omp::ParallelOp parallelOp =
+        rewriter.create<mlir::omp::ParallelOp>(doLoop.getLoc());
+
+    rewriter.createBlock(&parallelOp.getRegion());
+    mlir::Block &block = parallelOp.getRegion().back();
+
+    rewriter.setInsertionPointToEnd(&block);
+    rewriter.create<mlir::omp::TerminatorOp>(doLoop.getLoc());
+
+    rewriter.setInsertionPointToStart(&block);
+
+    // Clone the LB, UB, step defining ops inside the parallel region.
+    llvm::SmallVector<mlir::Value> lowerBound, upperBound, step;
+    lowerBound.push_back(
+        rewriter.clone(*doLoop.getLowerBound().getDefiningOp())->getResult(0));
+    upperBound.push_back(
+        rewriter.clone(*doLoop.getUpperBound().getDefiningOp())->getResult(0));
+    step.push_back(
+        rewriter.clone(*doLoop.getStep().getDefiningOp())->getResult(0));
+
+    auto wsLoopOp = rewriter.create<mlir::omp::WsLoopOp>(
+        doLoop.getLoc(), lowerBound, upperBound, step);
+    wsLoopOp.setInclusive(true);
+
+    auto outlineableOp =
+        mlir::dyn_cast<mlir::omp::OutlineableOpenMPOpInterface>(*parallelOp);
+    assert(outlineableOp);
+    rewriter.setInsertionPointToStart(outlineableOp.getAllocaBlock());
+
+    // For the induction variable, we need to privative its allocation and
+    // binding inside the parallel region.
+    llvm::SmallSetVector<mlir::Operation *, 2> workList;
+    // Therefore, we first discover the induction variable by discovering
+    // `fir.store`s where the source is the loop's block argument.
+    workList.insert(doLoop.getInductionVar().getUsers().begin(),
+                    doLoop.getInductionVar().getUsers().end());
+    llvm::SmallSetVector<fir::StoreOp, 2> inductionVarTargetStores;
+
+    // Walk the def-chain of the loop's block argument until we hit `fir.store`.
+    while (!workList.empty()) {
+      mlir::Operation *item = workList.front();
+
+      if (auto storeOp = mlir::dyn_cast<fir::StoreOp>(item)) {
+        inductionVarTargetStores.insert(storeOp);
+      } else {
+        workList.insert(item->getUsers().begin(), item->getUsers().end());
+      }
+
+      workList.remove(item);
+    }
+
+    // For each collected `fir.sotre`, find the target memref's alloca's and
+    // declare ops.
+    llvm::SmallSetVector<mlir::Operation *, 4> declareAndAllocasToClone;
+    for (auto storeOp : inductionVarTargetStores) {
+      mlir::Operation *storeTarget = storeOp.getMemref().getDefiningOp();
+
+      for (auto operand : storeTarget->getOperands()) {
+        declareAndAllocasToClone.insert(operand.getDefiningOp());
+      }
+      declareAndAllocasToClone.insert(storeTarget);
+    }
+
+    mlir::IRMapping mapper;
+
+    // Collect the memref defining ops in the parallel region.
+    for (mlir::Operation *opToClone : declareAndAllocasToClone) {
+      rewriter.clone(*opToClone, mapper);
+    }
+
+    // Clone the loop's body inside the worksharing construct using the mapped
+    // memref values.
+    rewriter.cloneRegionBefore(doLoop.getRegion(), wsLoopOp.getRegion(),
+                               wsLoopOp.getRegion().begin(), mapper);
+
+    mlir::Operation *terminator = wsLoopOp.getRegion().back().getTerminator();
+    rewriter.setInsertionPointToEnd(&wsLoopOp.getRegion().back());
+    rewriter.create<mlir::omp::YieldOp>(terminator->getLoc());
+    rewriter.eraseOp(terminator);
+
+    rewriter.eraseOp(doLoop);
+
+    return mlir::success();
+  }
+};
+
+class DoConcurrentConversionPass
+    : public fir::impl::DoConcurrentConversionPassBase<
+          DoConcurrentConversionPass> {
+public:
+  void runOnOperation() override {
+    mlir::func::FuncOp func = getOperation();
+
+    if (func.isDeclaration()) {
+      return;
+    }
+
+    auto *context = &getContext();
+    mlir::RewritePatternSet patterns(context);
+    patterns.insert<DoConcurrentConversion>(context);
+    mlir::ConversionTarget target(*context);
+    target.addLegalDialect<fir::FIROpsDialect, hlfir::hlfirDialect,
+                           mlir::arith::ArithDialect, mlir::func::FuncDialect,
+                           mlir::omp::OpenMPDialect>();
+
+    target.addDynamicallyLegalOp<fir::DoLoopOp>(
+        [](fir::DoLoopOp op) { return !op.getUnordered(); });
+
+    if (mlir::failed(mlir::applyFullConversion(getOperation(), target,
+                                               std::move(patterns)))) {
+      mlir::emitError(mlir::UnknownLoc::get(context),
+                      "error in converting do-concurrent op");
+      signalPassFailure();
+    }
+  }
+};
+} // namespace
+
+std::unique_ptr<mlir::Pass> fir::createDoConcurrentConversionPass() {
+  return std::make_unique<DoConcurrentConversionPass>();
+}
diff --git a/flang/test/Transforms/DoConcurrent/basic.mlir b/flang/test/Transforms/DoConcurrent/basic.mlir
new file mode 100644
index 00000000000000..b4a2ced45950e5
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/basic.mlir
@@ -0,0 +1,60 @@
+// Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
+
+// RUN: fir-opt --fopenmp-do-concurrent-conversion %s | FileCheck %s
+
+// CHECK-LABEL: func.func @do_concurrent_basic
+func.func @do_concurrent_basic() attributes {fir.bindc_name = "do_concurrent_basic"} {
+    // CHECK: %[[ARR:.*]]:2 = hlfir.declare %{{.*}}(%{{.*}}) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+    // CHECK: %[[C1:.*]] = arith.constant 1 : i32
+    // CHECK: %[[C10:.*]] = arith.constant 10 : i32
+
+    %0 = fir.alloca i32 {bindc_name = "i"}
+    %1:2 = hlfir.declare %0 {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+    %2 = fir.address_of(@_QFEa) : !fir.ref<!fir.array<10xi32>>
+    %c10 = arith.constant 10 : index
+    %3 = fir.shape %c10 : (index) -> !fir.shape<1>
+    %4:2 = hlfir.declare %2(%3) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+    %c1_i32 = arith.constant 1 : i32
+    %7 = fir.convert %c1_i32 : (i32) -> index
+    %c10_i32 = arith.constant 10 : i32
+    %8 = fir.convert %c10_i32 : (i32) -> index
+    %c1 = arith.constant 1 : index
+
+    // CHECK-NOT: fir.do_loop
+
+    // CHECK: omp.parallel {
+
+    // CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+    // CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
+    // CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
+    // CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
+    // CHECK: %[[STEP:.*]] = arith.constant 1 : index
+
+     // CHECK: omp.wsloop for (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+     // CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
+     // CHECK-NEXT: fir.store %[[IV_IDX]] to %[[BINDING]]#1 : !fir.ref<i32>
+     // CHECK-NEXT: %[[IV_VAL1:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+     // CHECK-NEXT: %[[IV_VAL2:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+     // CHECK-NEXT: %[[IV_VAL_I64:.*]] = fir.convert %[[IV_VAL2]] : (i32) -> i64
+     // CHECK-NEXT: %[[ARR_ACCESS:.*]] = hlfir.designate %[[ARR]]#0 (%[[IV_VAL_I64]])  : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+     // CHECK-NEXT: hlfir.assign %[[IV_VAL1]] to %[[ARR_ACCESS]] : i32, !fir.ref<i32>
+     // CHECK-NEXT: omp.yield
+     // CHECK-NEXT: }
+
+     // CHECK-NEXT: omp.terminator
+     // CHECK-NEXT: }
+    fir.do_loop %arg0 = %7 to %8 step %c1 unordered {
+      %13 = fir.convert %arg0 : (index) -> i32
+      fir.store %13 to %1#1 : !fir.ref<i32>
+      %14 = fir.load %1#0 : !fir.ref<i32>
+      %15 = fir.load %1#0 : !fir.ref<i32>
+      %16 = fir.convert %15 : (i32) -> i64
+      %17 = hlfir.designate %4#0 (%16)  : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+      hlfir.assign %14 to %17 : i32, !fir.ref<i32>
+    }
+
+    // CHECK-NOT: fir.do_loop
+
+    return
+  }

@kiranchandramohan
Copy link
Contributor

Nice start.

In case you have missed it, there are some recommendations for do concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md.

I guess the things to consider are:

  1. Is any additional analysis necessary before this conversion?
  2. How to map locality constraints, i believe they are currently handled in lowering. This will need to be delayed.
  3. Ensure this is only enabled with driver flags.

@ergawy
Copy link
Member Author

ergawy commented Jan 9, 2024

Thanks @kiranchandramohan for looking at this!

Nice start.

In case you have missed it, there are some recommendations for do concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md.

I guess the things to consider are:

1. Is any additional analysis necessary before this conversion?

I also have the feeling that some analyses will be needed but I am not sure what exactly yet. Maybe a starting point would be to verify that input loops respect the constraints listed by the spec, for example: C1121-C1137. I know some of these constraints are enforced by the front-end already before lowering so we will have to consider them one by one and check the best place to check for these constraints. There is also section 11.1.7.5 Additional semantics for DO CONCURRENT constructs in the spec which I think can investigated for further analyses before we apply the mapping to OMP. Do you have any particular analyses in mind outside of this?

2. How to map locality constraints, i believe they are currently handled in lowering. This will need to be delayed.

Indeed this is currently handled in lowering. Below is an example without local_init and then the same loop with local_init:

Without local_init this is the loop directly after lowering:
    // do concurrent (integer :: j=i:10)
    // end do
    fir.do_loop %arg0 = %8 to %9 step %c1 unordered {
      %10 = fir.convert %arg0 : (index) -> i32
      fir.store %10 to %1#1 : !fir.ref<i32>
    }
With local_init this is the loop directly after lowering:
    //  do concurrent (integer :: j=i:10) local_init(i)
    //  end do
    fir.do_loop %arg0 = %8 to %9 step %c1 unordered {
      %10 = fir.convert %arg0 : (index) -> i32
      fir.store %10 to %1#1 : !fir.ref<i32>
      %11 = fir.alloca i32 {bindc_name = "i", pinned, uniq_name = "_QFEi"}
      %12:2 = hlfir.declare %11 {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
      %13 = fir.load %6#0 : !fir.ref<i32>
      hlfir.assign %13 to %12#0 : i32, !fir.ref<i32>
    }
So, one possibility I think would be:
  • Extract the omp.map_info op to to some shared location between fir and omp dialects.
  • Use map_info to model locality constraints for do concurrent as well.

Do you see any blockers for such approach?

3. Ensure this is only enabled with driver flags.

Will do! For now, I suggest to just be a standalone pass until we flesh it out a bit more. Please let me know if you disagree.


In addition to what you mentioned, I think point number 4 would be how to model reductions. That's for the future though after the above points are addressed.


With that in-mind, do you mind if we review and merge the current PR as it is and defer further development to later PRs? I think doing all this in one PR won't be ideal.

@kiranchandramohan
Copy link
Contributor

Thanks @ergawy for the reply.

Since there are some strong opinions expressed about Do Concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md and also that there are several possible lowerings (sequential, CPU-OpenMP, GPU-OpenMP, CPU-SIMD, GPU-OpenACC), it might be good to agree on the design first and then proceed. The way to do this would be to write a design doc and submit a PR to the flang/docs directive and advertise in Flang Slack and discourse. In particular, we should be clear about the analysis required. I have not read the relevant sections in the standard in detail to reply to your question above, doesn't the document cover this?

@ergawy
Copy link
Member Author

ergawy commented Jan 9, 2024

Thanks @ergawy for the reply.

Since there are some strong opinions expressed about Do Concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md and also that there are several possible lowerings (sequential, CPU-OpenMP, GPU-OpenMP, CPU-SIMD, GPU-OpenACC), it might be good to agree on the design first and then proceed. The way to do this would be to write a design doc and submit a PR to the flang/docs directive and advertise in Flang Slack and discourse. In particular, we should be clear about the analysis required. I have not read the relevant sections in the standard in detail to reply to your question above, doesn't the document cover this?

Sure thing! I will start working on a design doc and open a separate (parent) PR for it.

@ergawy ergawy requested a review from klausler January 10, 2024 04:33
@klausler
Copy link
Contributor

Do not run DO CONCURRENT in parallel by default without ensuring first that the loop is free of non-parallelizable data usage.

@ergawy
Copy link
Member Author

ergawy commented Jan 11, 2024

Do not run DO CONCURRENT in parallel by default without ensuring first that the loop is free of non-parallelizable data usage.

Thanks for your reply. Yup, I am looking into understanding the inherent issues in do concurrent at the moment. I spent sometime going over your discussions on Github and J3 forums (not directly you but relevant to the issues you raised) ([1], [2], and the linked papers and documents) and trying to understand the issues in deeper detail. Once I form some more background, I will reach out to you if you don't mind since you thought about this problem a lot longer than I did.

@ergawy ergawy marked this pull request as draft January 11, 2024 06:45
@klausler
Copy link
Contributor

Do not run DO CONCURRENT in parallel by default without ensuring first that the loop is free of non-parallelizable data usage.

Thanks for your reply. Yup, I am looking into understanding the inherent issues in do concurrent at the moment. I spent sometime going over your discussions on Github and J3 forums (not directly you but relevant to the issues you raised) ([1], [2], and the linked papers and documents) and trying to understand the issues in deeper detail. Once I form some more background, I will reach out to you if you don't mind since you thought about this problem a lot longer than I did.

It is not complicated.

There are two common assumptions about DO CONCURRENT that are not true. First, due to the name, many people assume that DO CONCURRENT is an assertion by the programmer that the loop is safe to parallelize and/or a request to parallelize the loop. Second, that the restrictions imposed by the standard on DO CONCURRENT loops make any conformant program safe to parallelize.

The actual definition of DO CONCURRENT is such that it is safe to execute the iterations of the loop in any serial order. They are not sufficient to execute the iterations of the loop in parallel. You can write a DO CONCURRENT loop that is perfectly conformant with the standard but still unsafe to parallelize.

The standard could be fixed with a small change, but the committee won't do it.

If you have permission from the user via the command line to parallelize, fine. But by default, you still need to analyze the data flow in the loop to guarantee safe parallelization.

Adds a pass to map `do concurrent` to OpenMP worksharing consturcts. For
now, only maps basic loops to `omp parallel do`. This is still a WIP
with more work needed for testing and mapping more advanced loops.
This does not detail everything yet, still working through this.
@ergawy ergawy force-pushed the do_concurrent_to_omp branch from 4be76ec to 07d6c48 Compare February 16, 2024 09:32
Copy link
Contributor

@jeanPerier jeanPerier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ergawy, very late review on my side. Thanks for working on this, it is a great feature. On my side, as long as this pass is run behind a flag (given the points made by Peter), the direction looks good to me (I appreciate the fact that this is done in a pass so that do concurrent can also easily be mapped to other concepts).

let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";

let description = [{ This is an experimental pass to map `DO CONCURRENR` loops
to their correspnding equivalent OpenMP worksharing constructs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to their correspnding equivalent OpenMP worksharing constructs.
to their corresponding equivalent OpenMP worksharing constructs.

@ergawy
Copy link
Member Author

ergawy commented Feb 6, 2025

Sorry for the delay here. Since this was not the highest priority item on our roadmap, we have been working on this on the side over the past year on the ROCm fork. I will start upstreaming our downstream implementation starting with #126026. The pass will be hidden behind a CL flag and I tracked the data-dependence analysis (among other things) in a document that explains the current status and future steps.

@ergawy ergawy closed this Feb 6, 2025
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 6, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 6, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 6, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 11, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 12, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 13, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 17, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 17, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 18, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 19, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Feb 19, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Mar 4, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit that referenced this pull request Mar 10, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Mar 17, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces llvm#77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
ergawy added a commit that referenced this pull request Mar 21, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 27, 2025
…126026)

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful.

In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options.

This looks like a huge PR but a lot of the added stuff is documentation.

It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers.

PR stack:
- llvm#126026 (this PR)
- llvm#127595
- llvm#127633
- llvm#127634
- llvm#127635
ergawy added a commit that referenced this pull request Apr 2, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.

This looks like a huge PR but a lot of the added stuff is documentation.

It is also worth noting that the downstream pass has been validated on
https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived
performance speed-ups that match pure OpenMP, for GPU mapping we are
still working on extending our support for implicit memory mapping and
locality specifiers.

PR stack:
- #126026 (this PR)
- #127595
- #127633
- #127634
- #127635
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang:fir-hlfir flang Flang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants