Skip to content

Loop-pipelining invalid hoisting out #90870

Closed
@fotiskoun

Description

@fotiskoun

The loop pipelining may hoist-out some operations but does not check if it is legal to do so by making sure that it is not exceeding the original loop bounds, which is incorrect and can also lead to unexpected behaviour when accessing structures with fewer elements than the hoisting.

Consider the following example

func.func @f(%arg0: memref<4x16xf32>, %arg1: vector<16xf32>) -> vector<16xf32> {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %cst = arith.constant 0.000000e+00 : f32
  %c2 = arith.constant 2 : index
  %0 = scf.for %arg2 = %c0 to %c2 step %c1 iter_args(%arg3 = %arg1) -> (vector<16xf32>) {
    %1 = vector.transfer_read %arg0[%arg2, %c0], %cst {in_bounds = [true]} : memref<4x16xf32>, vector<16xf32>
    %2 = arith.addf %1, %arg3 : vector<16xf32>
    scf.yield %2 : vector<16xf32>
  }
  return %0 : vector<16xf32>
}
module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
    %0 = transform.structured.match ops{["scf.for"]} in %arg1 : (!transform.any_op) -> !transform.op<"scf.for">
    %1 = transform.loop.pipeline %0 {iteration_interval = 1 : i64, read_latency = 5 : i64,  scheduling_type = "full-loops"} : (!transform.op<"scf.for">) -> !transform.any_op
     transform.yield
 }
}

The output for this example is:

module {
  func.func @f(%arg0: memref<4x16xf32>, %arg1: vector<16xf32>) -> vector<16xf32> {
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %c4 = arith.constant 4 : index
    %c3 = arith.constant 3 : index
    %c2 = arith.constant 2 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = vector.transfer_read %arg0[%c0, %c0], %cst {in_bounds = [true]} : memref<4x16xf32>, vector<16xf32>
    %1 = vector.transfer_read %arg0[%c1, %c0], %cst {in_bounds = [true]} : memref<4x16xf32>, vector<16xf32>
    %2 = vector.transfer_read %arg0[%c2, %c0], %cst {in_bounds = [true]} : memref<4x16xf32>, vector<16xf32>
    %3 = vector.transfer_read %arg0[%c3, %c0], %cst {in_bounds = [true]} : memref<4x16xf32>, vector<16xf32>
    %4 = vector.transfer_read %arg0[%c4, %c0], %cst {in_bounds = [true]} : memref<4x16xf32>, vector<16xf32>
    %5 = vector.transfer_read %arg0[%c1, %c0], %cst {in_bounds = [true]} : memref<4x16xf32>, vector<16xf32>
    %6 = arith.addf %0, %arg1 : vector<16xf32>
    %7 = arith.addf %1, %6 : vector<16xf32>
    %8 = arith.addf %2, %7 : vector<16xf32>
    %9 = arith.addf %3, %8 : vector<16xf32>
    %10 = arith.addf %4, %9 : vector<16xf32>
    %11 = arith.addf %5, %10 : vector<16xf32>
    return %11 : vector<16xf32>
  }
  module attributes {transform.with_named_sequence} {
    transform.named_sequence @__transform_main(%arg0: !transform.any_op {transform.readonly}) {
      %0 = transform.structured.match ops{["scf.for"]} in %arg0 : (!transform.any_op) -> !transform.op<"scf.for">
      %1 = transform.loop.pipeline %0 {read_latency = 5 : i64} : (!transform.op<"scf.for">) -> !transform.any_op
      transform.yield
    }
  }
}

The result is incorrect as the original loop reads from %arg0[0] to %arg0[1] but after pipelining and unrolling the reads are going up to %arg0[4].

The solution proposed, checks the bounds of the loop and only allows the minimum unrolling between the loop bounds and the provided unrolling argument.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions