[WIP][flang][OpenMP] Experimental pass to map `do concurrent` to OMP #77285

ergawy · 2024-01-08T08:43:07Z

Adds a pass to map do concurrent to OpenMP worksharing consturcts. For now, only maps basic loops to omp parallel do. This is still a WIP with more work needed for testing and mapping more advanced loops.

So far, the pass is not added to any flang pipelines; it only lives as a standalone pass tested by fir-opt. Once we hopefully agree on the approach and the pass matures, I will add it to flang.

llvmbot · 2024-01-08T08:43:39Z

@llvm/pr-subscribers-flang-fir-hlfir

Author: Kareem Ergawy (ergawy)

Changes

Adds a pass to map do concurrent to OpenMP worksharing consturcts. For now, only maps basic loops to omp parallel do. This is still a WIP with more work needed for testing and mapping more advanced loops.

Full diff: https://github.com/llvm/llvm-project/pull/77285.diff

6 Files Affected:

(modified) flang/include/flang/Optimizer/HLFIR/HLFIROps.td (+1-1)
(modified) flang/include/flang/Optimizer/Transforms/Passes.h (+2)
(modified) flang/include/flang/Optimizer/Transforms/Passes.td (+20)
(modified) flang/lib/Optimizer/Transforms/CMakeLists.txt (+1)
(added) flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp (+162)
(added) flang/test/Transforms/DoConcurrent/basic.mlir (+60)

diff --git a/flang/include/flang/Optimizer/HLFIR/HLFIROps.td b/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
index 254771ca1780cf..718026e4fdfbe7 100644
--- a/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
+++ b/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
@@ -75,7 +75,7 @@ def hlfir_DeclareOp : hlfir_Op<"declare", [AttrSizedOperandSegments,
     func.func @foo(%arg0: !fir.ref<!fir.array<?x?x!fir.char<1,?>>>, %arg1: !fir.ref<i64>) {
       %c10 = arith.constant 10 : index
       %c20 = arith.constant 20 : index
-      %1 = fir.load %ag1 : fir.ref<i64>
+      %1 = fir.load %arg1 : fir.ref<i64>
       %2 = fir.shape_shift %c10, %1, %c20, %1 : (index, index, index, index) -> !fir.shapeshift<2>
       %3 = hfir.declare %arg0(%2) typeparams %1 {uniq_name = "c"} (fir.ref<!fir.array<?x?x!fir.char<1,?>>>, fir.shapeshift<2>, index) -> (fir.box<!fir.array<?x?x!fir.char<1,?>>>, fir.ref<!fir.array<?x?x!fir.char<1,?>>>)
       // ... uses %3#0 as "c"
diff --git a/flang/include/flang/Optimizer/Transforms/Passes.h b/flang/include/flang/Optimizer/Transforms/Passes.h
index 6970da8698ae84..1f389a1f571b75 100644
--- a/flang/include/flang/Optimizer/Transforms/Passes.h
+++ b/flang/include/flang/Optimizer/Transforms/Passes.h
@@ -93,6 +93,8 @@ std::unique_ptr<mlir::Pass> createFunctionAttrPass();
 std::unique_ptr<mlir::Pass>
 createFunctionAttrPass(FunctionAttrTypes &functionAttr);
 
+std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass();
+
 // declarative passes
 #define GEN_PASS_REGISTRATION
 #include "flang/Optimizer/Transforms/Passes.h.inc"
diff --git a/flang/include/flang/Optimizer/Transforms/Passes.td b/flang/include/flang/Optimizer/Transforms/Passes.td
index e3c45d41f04cc7..41d9be4249b5e2 100644
--- a/flang/include/flang/Optimizer/Transforms/Passes.td
+++ b/flang/include/flang/Optimizer/Transforms/Passes.td
@@ -370,4 +370,24 @@ def FunctionAttr : Pass<"function-attr", "mlir::func::FuncOp"> {
   let constructor = "::fir::createFunctionAttrPass()";
 }
 
+def DoConcurrentConversionPass : Pass<"fopenmp-do-concurrent-conversion", "mlir::func::FuncOp"> {
+  let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+
+  let description = [{ This is an experimental pass to map `DO CONCURRENR` loops
+     to their correspnding equivalent OpenMP worksharing constructs.
+
+     For now the following is supported:
+       - Mapping simple loops to `parallel do`.
+
+     Still to TODO:
+       - More extensive testing.
+       - Mapping to `target teams distribute parallel do`.
+       - Allowing the user to control mapping behavior: either to the host or
+         target.
+  }];
+
+  let constructor = "::fir::createDoConcurrentConversionPass()";
+  let dependentDialects = ["mlir::omp::OpenMPDialect"];
+}
+
 #endif // FLANG_OPTIMIZER_TRANSFORMS_PASSES
diff --git a/flang/lib/Optimizer/Transforms/CMakeLists.txt b/flang/lib/Optimizer/Transforms/CMakeLists.txt
index fc067ad3585395..13a46ef7422530 100644
--- a/flang/lib/Optimizer/Transforms/CMakeLists.txt
+++ b/flang/lib/Optimizer/Transforms/CMakeLists.txt
@@ -21,6 +21,7 @@ add_flang_library(FIRTransforms
   OMPMarkDeclareTarget.cpp
   VScaleAttr.cpp
   FunctionAttr.cpp
+  DoConcurrentConversion.cpp
 
   DEPENDS
   FIRDialect
diff --git a/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp b/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp
new file mode 100644
index 00000000000000..180c0bdf672af9
--- /dev/null
+++ b/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp
@@ -0,0 +1,162 @@
+//===- DoConcurrentConversion.cpp -- map `DO CONCURRENT` to OpenMP loops --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "flang/Optimizer/Dialect/FIRDialect.h"
+#include "flang/Optimizer/Dialect/FIROps.h"
+#include "flang/Optimizer/Dialect/FIRType.h"
+#include "flang/Optimizer/Dialect/Support/FIRContext.h"
+#include "flang/Optimizer/HLFIR/HLFIRDialect.h"
+#include "flang/Optimizer/Transforms/Passes.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
+#include "mlir/IR/Diagnostics.h"
+#include "mlir/IR/IRMapping.h"
+#include "mlir/Pass/Pass.h"
+#include "mlir/Transforms/DialectConversion.h"
+
+#include <memory>
+
+namespace fir {
+#define GEN_PASS_DEF_DOCONCURRENTCONVERSIONPASS
+#include "flang/Optimizer/Transforms/Passes.h.inc"
+} // namespace fir
+
+#define DEBUG_TYPE "fopenmp-do-concurrent-conversion"
+
+namespace {
+class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
+public:
+  using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
+
+  mlir::LogicalResult
+  matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
+                  mlir::ConversionPatternRewriter &rewriter) const override {
+    mlir::OpPrintingFlags flags;
+    flags.printGenericOpForm();
+
+    mlir::omp::ParallelOp parallelOp =
+        rewriter.create<mlir::omp::ParallelOp>(doLoop.getLoc());
+
+    rewriter.createBlock(&parallelOp.getRegion());
+    mlir::Block &block = parallelOp.getRegion().back();
+
+    rewriter.setInsertionPointToEnd(&block);
+    rewriter.create<mlir::omp::TerminatorOp>(doLoop.getLoc());
+
+    rewriter.setInsertionPointToStart(&block);
+
+    // Clone the LB, UB, step defining ops inside the parallel region.
+    llvm::SmallVector<mlir::Value> lowerBound, upperBound, step;
+    lowerBound.push_back(
+        rewriter.clone(*doLoop.getLowerBound().getDefiningOp())->getResult(0));
+    upperBound.push_back(
+        rewriter.clone(*doLoop.getUpperBound().getDefiningOp())->getResult(0));
+    step.push_back(
+        rewriter.clone(*doLoop.getStep().getDefiningOp())->getResult(0));
+
+    auto wsLoopOp = rewriter.create<mlir::omp::WsLoopOp>(
+        doLoop.getLoc(), lowerBound, upperBound, step);
+    wsLoopOp.setInclusive(true);
+
+    auto outlineableOp =
+        mlir::dyn_cast<mlir::omp::OutlineableOpenMPOpInterface>(*parallelOp);
+    assert(outlineableOp);
+    rewriter.setInsertionPointToStart(outlineableOp.getAllocaBlock());
+
+    // For the induction variable, we need to privative its allocation and
+    // binding inside the parallel region.
+    llvm::SmallSetVector<mlir::Operation *, 2> workList;
+    // Therefore, we first discover the induction variable by discovering
+    // `fir.store`s where the source is the loop's block argument.
+    workList.insert(doLoop.getInductionVar().getUsers().begin(),
+                    doLoop.getInductionVar().getUsers().end());
+    llvm::SmallSetVector<fir::StoreOp, 2> inductionVarTargetStores;
+
+    // Walk the def-chain of the loop's block argument until we hit `fir.store`.
+    while (!workList.empty()) {
+      mlir::Operation *item = workList.front();
+
+      if (auto storeOp = mlir::dyn_cast<fir::StoreOp>(item)) {
+        inductionVarTargetStores.insert(storeOp);
+      } else {
+        workList.insert(item->getUsers().begin(), item->getUsers().end());
+      }
+
+      workList.remove(item);
+    }
+
+    // For each collected `fir.sotre`, find the target memref's alloca's and
+    // declare ops.
+    llvm::SmallSetVector<mlir::Operation *, 4> declareAndAllocasToClone;
+    for (auto storeOp : inductionVarTargetStores) {
+      mlir::Operation *storeTarget = storeOp.getMemref().getDefiningOp();
+
+      for (auto operand : storeTarget->getOperands()) {
+        declareAndAllocasToClone.insert(operand.getDefiningOp());
+      }
+      declareAndAllocasToClone.insert(storeTarget);
+    }
+
+    mlir::IRMapping mapper;
+
+    // Collect the memref defining ops in the parallel region.
+    for (mlir::Operation *opToClone : declareAndAllocasToClone) {
+      rewriter.clone(*opToClone, mapper);
+    }
+
+    // Clone the loop's body inside the worksharing construct using the mapped
+    // memref values.
+    rewriter.cloneRegionBefore(doLoop.getRegion(), wsLoopOp.getRegion(),
+                               wsLoopOp.getRegion().begin(), mapper);
+
+    mlir::Operation *terminator = wsLoopOp.getRegion().back().getTerminator();
+    rewriter.setInsertionPointToEnd(&wsLoopOp.getRegion().back());
+    rewriter.create<mlir::omp::YieldOp>(terminator->getLoc());
+    rewriter.eraseOp(terminator);
+
+    rewriter.eraseOp(doLoop);
+
+    return mlir::success();
+  }
+};
+
+class DoConcurrentConversionPass
+    : public fir::impl::DoConcurrentConversionPassBase<
+          DoConcurrentConversionPass> {
+public:
+  void runOnOperation() override {
+    mlir::func::FuncOp func = getOperation();
+
+    if (func.isDeclaration()) {
+      return;
+    }
+
+    auto *context = &getContext();
+    mlir::RewritePatternSet patterns(context);
+    patterns.insert<DoConcurrentConversion>(context);
+    mlir::ConversionTarget target(*context);
+    target.addLegalDialect<fir::FIROpsDialect, hlfir::hlfirDialect,
+                           mlir::arith::ArithDialect, mlir::func::FuncDialect,
+                           mlir::omp::OpenMPDialect>();
+
+    target.addDynamicallyLegalOp<fir::DoLoopOp>(
+        [](fir::DoLoopOp op) { return !op.getUnordered(); });
+
+    if (mlir::failed(mlir::applyFullConversion(getOperation(), target,
+                                               std::move(patterns)))) {
+      mlir::emitError(mlir::UnknownLoc::get(context),
+                      "error in converting do-concurrent op");
+      signalPassFailure();
+    }
+  }
+};
+} // namespace
+
+std::unique_ptr<mlir::Pass> fir::createDoConcurrentConversionPass() {
+  return std::make_unique<DoConcurrentConversionPass>();
+}
diff --git a/flang/test/Transforms/DoConcurrent/basic.mlir b/flang/test/Transforms/DoConcurrent/basic.mlir
new file mode 100644
index 00000000000000..b4a2ced45950e5
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/basic.mlir
@@ -0,0 +1,60 @@
+// Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
+
+// RUN: fir-opt --fopenmp-do-concurrent-conversion %s | FileCheck %s
+
+// CHECK-LABEL: func.func @do_concurrent_basic
+func.func @do_concurrent_basic() attributes {fir.bindc_name = "do_concurrent_basic"} {
+    // CHECK: %[[ARR:.*]]:2 = hlfir.declare %{{.*}}(%{{.*}}) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+    // CHECK: %[[C1:.*]] = arith.constant 1 : i32
+    // CHECK: %[[C10:.*]] = arith.constant 10 : i32
+
+    %0 = fir.alloca i32 {bindc_name = "i"}
+    %1:2 = hlfir.declare %0 {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+    %2 = fir.address_of(@_QFEa) : !fir.ref<!fir.array<10xi32>>
+    %c10 = arith.constant 10 : index
+    %3 = fir.shape %c10 : (index) -> !fir.shape<1>
+    %4:2 = hlfir.declare %2(%3) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+    %c1_i32 = arith.constant 1 : i32
+    %7 = fir.convert %c1_i32 : (i32) -> index
+    %c10_i32 = arith.constant 10 : i32
+    %8 = fir.convert %c10_i32 : (i32) -> index
+    %c1 = arith.constant 1 : index
+
+    // CHECK-NOT: fir.do_loop
+
+    // CHECK: omp.parallel {
+
+    // CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+    // CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
+    // CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
+    // CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
+    // CHECK: %[[STEP:.*]] = arith.constant 1 : index
+
+     // CHECK: omp.wsloop for (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+     // CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
+     // CHECK-NEXT: fir.store %[[IV_IDX]] to %[[BINDING]]#1 : !fir.ref<i32>
+     // CHECK-NEXT: %[[IV_VAL1:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+     // CHECK-NEXT: %[[IV_VAL2:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+     // CHECK-NEXT: %[[IV_VAL_I64:.*]] = fir.convert %[[IV_VAL2]] : (i32) -> i64
+     // CHECK-NEXT: %[[ARR_ACCESS:.*]] = hlfir.designate %[[ARR]]#0 (%[[IV_VAL_I64]])  : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+     // CHECK-NEXT: hlfir.assign %[[IV_VAL1]] to %[[ARR_ACCESS]] : i32, !fir.ref<i32>
+     // CHECK-NEXT: omp.yield
+     // CHECK-NEXT: }
+
+     // CHECK-NEXT: omp.terminator
+     // CHECK-NEXT: }
+    fir.do_loop %arg0 = %7 to %8 step %c1 unordered {
+      %13 = fir.convert %arg0 : (index) -> i32
+      fir.store %13 to %1#1 : !fir.ref<i32>
+      %14 = fir.load %1#0 : !fir.ref<i32>
+      %15 = fir.load %1#0 : !fir.ref<i32>
+      %16 = fir.convert %15 : (i32) -> i64
+      %17 = hlfir.designate %4#0 (%16)  : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+      hlfir.assign %14 to %17 : i32, !fir.ref<i32>
+    }
+
+    // CHECK-NOT: fir.do_loop
+
+    return
+  }

kiranchandramohan · 2024-01-08T09:58:23Z

Nice start.

In case you have missed it, there are some recommendations for do concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md.

I guess the things to consider are:

Is any additional analysis necessary before this conversion?
How to map locality constraints, i believe they are currently handled in lowering. This will need to be delayed.
Ensure this is only enabled with driver flags.

ergawy · 2024-01-09T09:23:10Z

Thanks @kiranchandramohan for looking at this!

Nice start.

In case you have missed it, there are some recommendations for do concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md.

I guess the things to consider are:
1. Is any additional analysis necessary before this conversion?

I also have the feeling that some analyses will be needed but I am not sure what exactly yet. Maybe a starting point would be to verify that input loops respect the constraints listed by the spec, for example: C1121-C1137. I know some of these constraints are enforced by the front-end already before lowering so we will have to consider them one by one and check the best place to check for these constraints. There is also section 11.1.7.5 Additional semantics for DO CONCURRENT constructs in the spec which I think can investigated for further analyses before we apply the mapping to OMP. Do you have any particular analyses in mind outside of this?

2. How to map locality constraints, i believe they are currently handled in lowering. This will need to be delayed.

Indeed this is currently handled in lowering. Below is an example without local_init and then the same loop with local_init:

Without `local_init` this is the loop directly after lowering:

    // do concurrent (integer :: j=i:10)
    // end do
    fir.do_loop %arg0 = %8 to %9 step %c1 unordered {
      %10 = fir.convert %arg0 : (index) -> i32
      fir.store %10 to %1#1 : !fir.ref<i32>
    }

With `local_init` this is the loop directly after lowering:

    //  do concurrent (integer :: j=i:10) local_init(i)
    //  end do
    fir.do_loop %arg0 = %8 to %9 step %c1 unordered {
      %10 = fir.convert %arg0 : (index) -> i32
      fir.store %10 to %1#1 : !fir.ref<i32>
      %11 = fir.alloca i32 {bindc_name = "i", pinned, uniq_name = "_QFEi"}
      %12:2 = hlfir.declare %11 {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
      %13 = fir.load %6#0 : !fir.ref<i32>
      hlfir.assign %13 to %12#0 : i32, !fir.ref<i32>
    }

So, one possibility I think would be:

Extract the omp.map_info op to to some shared location between fir and omp dialects.
Use map_info to model locality constraints for do concurrent as well.

Do you see any blockers for such approach?

3. Ensure this is only enabled with driver flags.

Will do! For now, I suggest to just be a standalone pass until we flesh it out a bit more. Please let me know if you disagree.

In addition to what you mentioned, I think point number 4 would be how to model reductions. That's for the future though after the above points are addressed.

With that in-mind, do you mind if we review and merge the current PR as it is and defer further development to later PRs? I think doing all this in one PR won't be ideal.

kiranchandramohan · 2024-01-09T12:34:15Z

Thanks @ergawy for the reply.

Since there are some strong opinions expressed about Do Concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md and also that there are several possible lowerings (sequential, CPU-OpenMP, GPU-OpenMP, CPU-SIMD, GPU-OpenACC), it might be good to agree on the design first and then proceed. The way to do this would be to write a design doc and submit a PR to the flang/docs directive and advertise in Flang Slack and discourse. In particular, we should be clear about the analysis required. I have not read the relevant sections in the standard in detail to reply to your question above, doesn't the document cover this?

ergawy · 2024-01-09T12:43:17Z

Thanks @ergawy for the reply.

Since there are some strong opinions expressed about Do Concurrent in https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md and also that there are several possible lowerings (sequential, CPU-OpenMP, GPU-OpenMP, CPU-SIMD, GPU-OpenACC), it might be good to agree on the design first and then proceed. The way to do this would be to write a design doc and submit a PR to the flang/docs directive and advertise in Flang Slack and discourse. In particular, we should be clear about the analysis required. I have not read the relevant sections in the standard in detail to reply to your question above, doesn't the document cover this?

Sure thing! I will start working on a design doc and open a separate (parent) PR for it.

klausler · 2024-01-10T17:01:38Z

Do not run DO CONCURRENT in parallel by default without ensuring first that the loop is free of non-parallelizable data usage.

ergawy · 2024-01-11T06:39:08Z

Do not run DO CONCURRENT in parallel by default without ensuring first that the loop is free of non-parallelizable data usage.

Thanks for your reply. Yup, I am looking into understanding the inherent issues in do concurrent at the moment. I spent sometime going over your discussions on Github and J3 forums (not directly you but relevant to the issues you raised) ([1], [2], and the linked papers and documents) and trying to understand the issues in deeper detail. Once I form some more background, I will reach out to you if you don't mind since you thought about this problem a lot longer than I did.

klausler · 2024-01-11T16:45:29Z

Do not run DO CONCURRENT in parallel by default without ensuring first that the loop is free of non-parallelizable data usage.

Thanks for your reply. Yup, I am looking into understanding the inherent issues in do concurrent at the moment. I spent sometime going over your discussions on Github and J3 forums (not directly you but relevant to the issues you raised) ([1], [2], and the linked papers and documents) and trying to understand the issues in deeper detail. Once I form some more background, I will reach out to you if you don't mind since you thought about this problem a lot longer than I did.

It is not complicated.

There are two common assumptions about DO CONCURRENT that are not true. First, due to the name, many people assume that DO CONCURRENT is an assertion by the programmer that the loop is safe to parallelize and/or a request to parallelize the loop. Second, that the restrictions imposed by the standard on DO CONCURRENT loops make any conformant program safe to parallelize.

The actual definition of DO CONCURRENT is such that it is safe to execute the iterations of the loop in any serial order. They are not sufficient to execute the iterations of the loop in parallel. You can write a DO CONCURRENT loop that is perfectly conformant with the standard but still unsafe to parallelize.

The standard could be fixed with a small change, but the committee won't do it.

If you have permission from the user via the command line to parallelize, fine. But by default, you still need to analyze the data flow in the loop to guarantee safe parallelization.

Adds a pass to map `do concurrent` to OpenMP worksharing consturcts. For now, only maps basic loops to `omp parallel do`. This is still a WIP with more work needed for testing and mapping more advanced loops.

This does not detail everything yet, still working through this.

jeanPerier

Hi @ergawy, very late review on my side. Thanks for working on this, it is a great feature. On my side, as long as this pass is run behind a flag (given the points made by Peter), the direction looks good to me (I appreciate the fact that this is done in a pass so that do concurrent can also easily be mapped to other concepts).

jeanPerier · 2024-01-08T09:47:03Z

flang/include/flang/Optimizer/Transforms/Passes.td

+  let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+
+  let description = [{ This is an experimental pass to map `DO CONCURRENR` loops
+     to their correspnding equivalent OpenMP worksharing constructs.


Suggested change

to their correspnding equivalent OpenMP worksharing constructs.

to their corresponding equivalent OpenMP worksharing constructs.

ergawy · 2025-02-06T09:05:59Z

Sorry for the delay here. Since this was not the highest priority item on our roadmap, we have been working on this on the side over the past year on the ROCm fork. I will start upstreaming our downstream implementation starting with #126026. The pass will be hidden behind a CL flag and I tracked the data-dependence analysis (among other things) in a document that explains the current status and future steps.

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options.

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options.

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options.

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options.

…126026) This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm#126026 (this PR) - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - #126026 (this PR) - #127595 - #127633 - #127634 - #127635

ergawy self-assigned this Jan 8, 2024

ergawy requested review from jeanPerier and kiranchandramohan January 8, 2024 08:43

llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Jan 8, 2024

ergawy requested a review from tblah January 8, 2024 08:43

ergawy requested review from DominikAdamski, raghavendhra, Meinersbur, agozillon, skatrak and mjklemm January 8, 2024 08:47

ergawy requested a review from klausler January 10, 2024 04:33

ergawy marked this pull request as draft January 11, 2024 06:45

ergawy added 2 commits February 16, 2024 02:24

[WIP][flang][OpenMP] Experimental pass to map do concurrent to OMP

4419514

Adds a pass to map `do concurrent` to OpenMP worksharing consturcts. For now, only maps basic loops to `omp parallel do`. This is still a WIP with more work needed for testing and mapping more advanced loops.

Details on some of the analyses we might need.

07d6c48

This does not detail everything yet, still working through this.

ergawy force-pushed the do_concurrent_to_omp branch from 4be76ec to 07d6c48 Compare February 16, 2024 09:32

jeanPerier reviewed Dec 3, 2024

View reviewed changes

ergawy mentioned this pull request Feb 6, 2025

[flang][OpenMP] Upstream first part of do concurrent mapping #126026

Merged

ergawy closed this Feb 6, 2025

kiranchandramohan mentioned this pull request Feb 25, 2025

[flang][OpenMP] Upstream do concurrent loop-nest detection. #127595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][flang][OpenMP] Experimental pass to map `do concurrent` to OMP #77285

[WIP][flang][OpenMP] Experimental pass to map `do concurrent` to OMP #77285

Uh oh!

ergawy commented Jan 8, 2024 •

edited

Loading

Uh oh!

llvmbot commented Jan 8, 2024

Uh oh!

kiranchandramohan commented Jan 8, 2024

Uh oh!

ergawy commented Jan 9, 2024

Uh oh!

kiranchandramohan commented Jan 9, 2024

Uh oh!

ergawy commented Jan 9, 2024

Uh oh!

klausler commented Jan 10, 2024

Uh oh!

ergawy commented Jan 11, 2024 •

edited

Loading

Uh oh!

klausler commented Jan 11, 2024

Uh oh!

jeanPerier left a comment

Uh oh!

jeanPerier Jan 8, 2024

Uh oh!

ergawy commented Feb 6, 2025

Uh oh!

Uh oh!

	to their correspnding equivalent OpenMP worksharing constructs.
	to their corresponding equivalent OpenMP worksharing constructs.

[WIP][flang][OpenMP] Experimental pass to map do concurrent to OMP #77285

[WIP][flang][OpenMP] Experimental pass to map do concurrent to OMP #77285

Uh oh!

Conversation

ergawy commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 8, 2024

Uh oh!

kiranchandramohan commented Jan 8, 2024

Uh oh!

ergawy commented Jan 9, 2024

Without local_init this is the loop directly after lowering:

With local_init this is the loop directly after lowering:

So, one possibility I think would be:

Uh oh!

kiranchandramohan commented Jan 9, 2024

Uh oh!

ergawy commented Jan 9, 2024

Uh oh!

klausler commented Jan 10, 2024

Uh oh!

ergawy commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klausler commented Jan 11, 2024

Uh oh!

jeanPerier left a comment

Choose a reason for hiding this comment

Uh oh!

jeanPerier Jan 8, 2024

Choose a reason for hiding this comment

Uh oh!

ergawy commented Feb 6, 2025

Uh oh!

Uh oh!

[WIP][flang][OpenMP] Experimental pass to map `do concurrent` to OMP #77285

[WIP][flang][OpenMP] Experimental pass to map `do concurrent` to OMP #77285

ergawy commented Jan 8, 2024 •

edited

Loading

Without `local_init` this is the loop directly after lowering:

With `local_init` this is the loop directly after lowering:

ergawy commented Jan 11, 2024 •

edited

Loading