[mlir][arith] Add more canonicalization and integration tests coverage #92272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

pingshiyu wants to merge 13 commits into llvm:main from pingshiyu:arith-regression

Contributor

pingshiyu commented May 15, 2024

I've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations.

It found a bunch of bugs in the arith folders, optimisations and lowering.

However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter.

I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p).

The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the Integration folder. However, it could also be adapted as canonicalization tests (caveat being it can't exercise lowering code).

I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests.

Please do let me know if adding these would be helpful, and what's the preferred format for them.

Once/if that's all agreed upon I can add the rest of the regression suite!


          [mlir][arith] new regression tests from independent interpreter devel…

bda65c6

…opment

llvmbot added mlir mlir:arith labels

Member

llvmbot commented May 15, 2024

@llvm/pr-subscribers-mlir-arith

Author: Jacob Yu (pingshiyu)

Changes

I've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations.

It found a bunch of bugs in the arith folders, optimisations and lowering.

However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter.

I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p).

The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the Integration folder. However, it could also be adapted as canonicalization tests (caveat being it can't exercise lowering code).

I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests.

Please do let me know if adding these would be helpful, and what's the preferred format for them.

Once/if that's all agreed upon I can add the rest of the regression suite!

Full diff: https://github.com/llvm/llvm-project/pull/92272.diff

2 Files Affected:

(modified) mlir/test/Dialect/Arith/canonicalize.mlir (+30)
(added) mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir (+52)

diff --git a/mlir/test/Dialect/Arith/canonicalize.mlir b/mlir/test/Dialect/Arith/canonicalize.mlir
index e4f95bb0545a2..aa79ad1bead56 100644
--- a/mlir/test/Dialect/Arith/canonicalize.mlir
+++ b/mlir/test/Dialect/Arith/canonicalize.mlir
@@ -3022,6 +3022,36 @@ func.func @minsi_i0() -> i0 {
   return %minsi : i0
 }
 
+// CHECK-LABEL: @extsiOnI1
+//       CHECK: %[[TRUE:.*]] = arith.constant true
+//       CHECK: %[[CST:.*]] = arith.constant -1 : i16
+//       CHECK: return %[[TRUE]], %[[CST]]
+func.func @extsiOnI1() -> (i1, i16) {
+  %true = arith.constant -1 : i1
+  %0 = arith.extsi %true : i1 to i16
+  return %true, %0 : i1, i16
+}
+
+// CHECK-LABEL: @extuiOn1I1
+//       CHECK: %[[TRUE:.*]] = arith.constant true
+//       CHECK: %[[CST:.*]] = arith.constant 1 : i64
+//       CHECK: return %[[TRUE]], %[[CST]]
+func.func @extuiOn1I1() -> (i1, i64) {
+  %true = arith.constant true
+  %0 = arith.extui %true : i1 to i64
+  return %true, %0 : i1, i64
+}
+
+// CHECK-LABEL: @trunciI16ToI8
+//       CHECK: %[[CST:.*]] = arith.constant 20194 : i16
+//       CHECK: %[[CST2:.*]] = arith.constant -30 : i8
+//       CHECK: return %[[CST]], %[[CST2]]
+func.func @trunciI16ToI8() -> (i16, i8) {
+  %c20194_i16 = arith.constant 20194 : i16
+  %0 = arith.trunci %c20194_i16 : i16 to i8
+  return %c20194_i16, %0 : i16, i8
+}
+
 // CHECK-LABEL: @mulsi_extended_i0
 //       CHECK:   %[[ZERO:.*]] = arith.constant 0 : i0
 //       CHECK:   return %[[ZERO]], %[[ZERO]] : i0
diff --git a/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
new file mode 100644
index 0000000000000..cd1cad4d56da9
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
@@ -0,0 +1,52 @@
+// Regression test suite from an independent development of Arith's
+// semantics and a fuzzer
+
+// RUN: mlir-opt %s --convert-scf-to-cf --convert-cf-to-llvm --convert-vector-to-llvm \
+// RUN:             --convert-func-to-llvm --convert-arith-to-llvm | \
+// RUN:   mlir-cpu-runner -e entry -entry-point-result=void \
+// RUN:                   --shared-libs=%mlir_c_runner_utils | \
+// RUN:   FileCheck %s --match-full-lines
+
+module {
+  func.func @extsiOnI1() {
+    %true = arith.constant -1 : i1
+    %0 = arith.extsi %true : i1 to i16
+    vector.print %true : i1
+    vector.print %0 : i16
+    return
+  }
+
+  func.func @extuiOn1I1() {
+    %true = arith.constant true
+    %0 = arith.extui %true : i1 to i64
+    vector.print %true : i1
+    vector.print %0 : i64
+    return
+  }
+
+  func.func @trunciI16ToI8() {
+    %c20194_i16 = arith.constant 20194 : i16
+    %0 = arith.trunci %c20194_i16 : i16 to i8
+    vector.print %c20194_i16 : i16
+    vector.print %0 : i8
+    return
+  }
+
+  func.func @entry() {
+
+    // CHECK:      1
+    // CHECK:      -1
+    func.call @extsiOnI1() : () -> ()
+    
+    // CHECK:      1
+    // CHECK:      1
+    func.call @extuiOn1I1() : () -> ()
+
+    // CHECK:      20194
+    // CHECK:      -30
+    func.call @trunciI16ToI8() : () -> ()
+
+
+    return
+  }
+}
\ No newline at end of file

Member

llvmbot commented May 15, 2024

@llvm/pr-subscribers-mlir

Author: Jacob Yu (pingshiyu)

Changes

I've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations.

It found a bunch of bugs in the arith folders, optimisations and lowering.

However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter.

I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p).

The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the Integration folder. However, it could also be adapted as canonicalization tests (caveat being it can't exercise lowering code).

I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests.

Please do let me know if adding these would be helpful, and what's the preferred format for them.

Once/if that's all agreed upon I can add the rest of the regression suite!

Full diff: https://github.com/llvm/llvm-project/pull/92272.diff

2 Files Affected:

(modified) mlir/test/Dialect/Arith/canonicalize.mlir (+30)
(added) mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir (+52)

diff --git a/mlir/test/Dialect/Arith/canonicalize.mlir b/mlir/test/Dialect/Arith/canonicalize.mlir
index e4f95bb0545a2..aa79ad1bead56 100644
--- a/mlir/test/Dialect/Arith/canonicalize.mlir
+++ b/mlir/test/Dialect/Arith/canonicalize.mlir
@@ -3022,6 +3022,36 @@ func.func @minsi_i0() -> i0 {
   return %minsi : i0
 }
 
+// CHECK-LABEL: @extsiOnI1
+//       CHECK: %[[TRUE:.*]] = arith.constant true
+//       CHECK: %[[CST:.*]] = arith.constant -1 : i16
+//       CHECK: return %[[TRUE]], %[[CST]]
+func.func @extsiOnI1() -> (i1, i16) {
+  %true = arith.constant -1 : i1
+  %0 = arith.extsi %true : i1 to i16
+  return %true, %0 : i1, i16
+}
+
+// CHECK-LABEL: @extuiOn1I1
+//       CHECK: %[[TRUE:.*]] = arith.constant true
+//       CHECK: %[[CST:.*]] = arith.constant 1 : i64
+//       CHECK: return %[[TRUE]], %[[CST]]
+func.func @extuiOn1I1() -> (i1, i64) {
+  %true = arith.constant true
+  %0 = arith.extui %true : i1 to i64
+  return %true, %0 : i1, i64
+}
+
+// CHECK-LABEL: @trunciI16ToI8
+//       CHECK: %[[CST:.*]] = arith.constant 20194 : i16
+//       CHECK: %[[CST2:.*]] = arith.constant -30 : i8
+//       CHECK: return %[[CST]], %[[CST2]]
+func.func @trunciI16ToI8() -> (i16, i8) {
+  %c20194_i16 = arith.constant 20194 : i16
+  %0 = arith.trunci %c20194_i16 : i16 to i8
+  return %c20194_i16, %0 : i16, i8
+}
+
 // CHECK-LABEL: @mulsi_extended_i0
 //       CHECK:   %[[ZERO:.*]] = arith.constant 0 : i0
 //       CHECK:   return %[[ZERO]], %[[ZERO]] : i0
diff --git a/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
new file mode 100644
index 0000000000000..cd1cad4d56da9
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
@@ -0,0 +1,52 @@
+// Regression test suite from an independent development of Arith's
+// semantics and a fuzzer
+
+// RUN: mlir-opt %s --convert-scf-to-cf --convert-cf-to-llvm --convert-vector-to-llvm \
+// RUN:             --convert-func-to-llvm --convert-arith-to-llvm | \
+// RUN:   mlir-cpu-runner -e entry -entry-point-result=void \
+// RUN:                   --shared-libs=%mlir_c_runner_utils | \
+// RUN:   FileCheck %s --match-full-lines
+
+module {
+  func.func @extsiOnI1() {
+    %true = arith.constant -1 : i1
+    %0 = arith.extsi %true : i1 to i16
+    vector.print %true : i1
+    vector.print %0 : i16
+    return
+  }
+
+  func.func @extuiOn1I1() {
+    %true = arith.constant true
+    %0 = arith.extui %true : i1 to i64
+    vector.print %true : i1
+    vector.print %0 : i64
+    return
+  }
+
+  func.func @trunciI16ToI8() {
+    %c20194_i16 = arith.constant 20194 : i16
+    %0 = arith.trunci %c20194_i16 : i16 to i8
+    vector.print %c20194_i16 : i16
+    vector.print %0 : i8
+    return
+  }
+
+  func.func @entry() {
+
+    // CHECK:      1
+    // CHECK:      -1
+    func.call @extsiOnI1() : () -> ()
+    
+    // CHECK:      1
+    // CHECK:      1
+    func.call @extuiOn1I1() : () -> ()
+
+    // CHECK:      20194
+    // CHECK:      -30
+    func.call @trunciI16ToI8() : () -> ()
+
+
+    return
+  }
+}
\ No newline at end of file

Contributor Author

pingshiyu commented May 15, 2024 •

edited

Loading

@kuhar @Lewuathe @joker-eph @banach-space @math-fehr

math-fehr reviewed

View reviewed changes

Contributor

math-fehr left a comment

Nice!
Quick question, what kind of bugs did you find in the canonicalization patterns and the lowerings?
Do you have issues/PR for them already?

Also, is your fuzzer/semantics open-source somewhere? It would be interesting for me to see them, to compare them with the one I currently have for arith (https://github.com/opencompl/xdsl-smt/blob/main/xdsl_smt/semantics/arith_semantics.py)

mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir Outdated Show resolved Hide resolved

joker-eph changed the title ~~[mlir][arith] new integration tests from independent interpreter development~~ [mlir][arith] Add more test coverage

joker-eph changed the title ~~[mlir][arith] Add more test coverage~~ [mlir][arith] Add more canonicalization tests coverage

joker-eph changed the title ~~[mlir][arith] Add more canonicalization tests coverage~~ [mlir][arith] Add more canonicalization and integration tests coverage


          Update mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir

Co-authored-by: Fehr Mathieu <mathieu.fehr@gmail.com>

Contributor Author

pingshiyu commented May 15, 2024 •

edited

Loading

Nice! Quick question, what kind of bugs did you find in the canonicalization patterns and the lowerings? Do you have issues/PR for them already?

Also, is your fuzzer/semantics open-source somewhere? It would be interesting for me to see them, to compare them with the one I currently have for arith (https://github.com/opencompl/xdsl-smt/blob/main/xdsl_smt/semantics/arith_semantics.py)

Yep! The bugs found by this method are here (some are as of now unfixed! :o)
#90296 (!!)
#90238 (!!)
#82788 (!!)
#84986 (!!)
#83079
#88732
#89382

The fuzzer/semantics are unfortunately not open source at the moment, but I do plan on making it so in the near future! There's a bit more I plan to work on before it's public :)

Contributor Author

pingshiyu commented May 15, 2024 •

edited

Loading

I want to ask: what is the preferred format for these tests?

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.


          fix up! adding check-nexts

d67d02a

kuhar reviewed

View reviewed changes

mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir Outdated Show resolved Hide resolved

mlir/test/Dialect/Arith/canonicalize.mlir Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir Show resolved Hide resolved

kuhar reviewed

View reviewed changes

mlir/test/Integration/Dialect/Arith/CPU/test-int-trunc-ext.mlir Outdated Show resolved Hide resolved

pingshiyu added 2 commits

May 15, 2024 22:24


          formatting

21c3431


          removed unneeded files

5a2b65c

kuhar approved these changes

View reviewed changes

Collaborator

joker-eph commented May 15, 2024

I want to ask: what is the preferred format for these tests?

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.

Historically speaking, we are not relying to much on integration tests. I'm not against them on principle though, as long as they are tiny (fast to execute) and easy to maintain.

One important aspect though is that our test coverage does not rely on the integration tests: all of our canonicalization should be unit-tested.

Contributor

banach-space commented May 16, 2024

Thanks for sharing these, @pingshiyu !

I'm always in favour of improving our test coverage, but if we intend to add much more then it would be good to have some plan.

Another question that I'd ask - how far do we go with testing? Do we want to exercise every possible combination? For example, looking at "test-int-trunc-ext.mlir" that you added, there are the following 3 cases:

@extsiOnI1
@extuiOn1I1
@trunciI16ToI8

Why not more, i.e. other data types and operations? Once we consider all the possibilities, the potential set of all tests becomes really large. This can be a maintenance burden. Are we worried?

Also, if we are to add many more tests, it would be good to establish some structure so that it's easy to identify cases that are covered and to find cases that are missing. Otherwise, we tend to duplicate tests, I've seen examples in the Vector dialect, not sure about Arith. This can usually be improved with comments and consistent test function naming.

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.

I find integration tests very helpful, but they tend to be expensive and hence hidden behind a CMake flag. Improving the coverage for Arith would be a welcome addition, but it would be good to have a strategy for selecting good test cases. To me, this is a very good reference:

https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/Complex/CPU/correctness.mlir

IIUC, there's a case for every operation and every Op is verified with a set of inputs. I know that @Lewuathe has recently contributed there - would you have any recommendation for good cases to test?


          adding rest of integration tests

ac20f66

Contributor Author

pingshiyu commented May 16, 2024

Thanks for sharing these, @pingshiyu !

I'm always in favour of improving our test coverage, but if we intend to add much more then it would be good to have some plan.

Another question that I'd ask - how far do we go with testing? Do we want to exercise every possible combination? For example, looking at "test-int-trunc-ext.mlir" that you added, there are the following 3 cases:
* `@extsiOnI1`

* `@extuiOn1I1`

* `@trunciI16ToI8`
Why not more, i.e. other data types and operations? Once we consider all the possibilities, the potential set of all tests becomes really large. This can be a maintenance burden. Are we worried?

Also, if we are to add many more tests, it would be good to establish some structure so that it's easy to identify cases that are covered and to find cases that are missing. Otherwise, we tend to duplicate tests, I've seen examples in the Vector dialect, not sure about Arith. This can usually be improved with comments and consistent test function naming.

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.

I find integration tests very helpful, but they tend to be expensive and hence hidden behind a CMake flag. Improving the coverage for Arith would be a welcome addition, but it would be good to have a strategy for selecting good test cases. To me, this is a very good reference:
* https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/Complex/CPU/correctness.mlir
IIUC, there's a case for every operation and every Op is verified with a set of inputs. I know that @Lewuathe has recently contributed there - would you have any recommendation for good cases to test?

Definitely a good point on having the new tests be organised in a way that it's easy to identify gaps!

The current tests don't cover all cases, although I think it wouldn't be realistic to do so either - that would be a combinatorial explosion and is perhaps better explored using property-based testing :) For this PR, my aim is to just add all the tests that I have accrued from the interpreter exercise, and most of these were once a bug within the interpreter

In the most recent commit I've added the remainder of the tests as integration tests. I avoided the canonicalization.mlir tests, because like you pointed out, it is already rather large, unwieldy to go through, and there might well be duplicates.

I've assigned the tests a rough category for what they're testing, according to the file names added :)

Member

kuhar commented May 16, 2024

Another question that I'd ask - how far do we go with testing? Do we want to exercise every possible combination? For example, looking at "test-int-trunc-ext.mlir" that you added, there are the following 3 cases:

@extsiOnI1

@extuiOn1I1

@trunciI16ToI8

Why not more, i.e. other data types and operations? Once we consider all the possibilities, the potential set of all tests becomes really large. This can be a maintenance burden. Are we worried?

Also, if we are to add many more tests, it would be good to establish some structure so that it's easy to identify cases that are covered and to find cases that are missing. Otherwise, we tend to duplicate tests, I've seen examples in the Vector dialect, not sure about Arith. This can usually be improved with comments and consistent test function naming.

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or a

Some good points here! IMO this specific PR falls under the broader umbrella of regression tests -- I think there's a lot of values in handling the corner cases we did not handle properly in the past. We should definitely have LIT test coverage there. In llvm it used to be much common to check in tests taken verbatim from bugzilla in individual regression test files. I wouldn't go probably go as far, and having the regression tests organized in some logical buckets makes sense to me. Similarly, exhaustively testing every permutation of types / operations is unlikely to add much value on it's own, but I wouldn't mind adding more bugs if we do discover that our coverage is insufficient and start seeing bugs being reported.

As for the integration tests, I think MLIR is really undertested. There's no in-tree frontend compiler (except for flang?) to truly exercise the conversions on real hardware and I think there's also value to have integration tests for conversions that may be tricky to get right (e.g., wide integer emulation) or easy to miscompile by the backend compiler. I think these extension tests fir in nicely here.

Contributor

banach-space commented May 17, 2024

Thank you all for replying to my comments!

1. Wider discussion on testing

In llvm it used to be much common to check in tests taken verbatim from bugzilla in individual regression test files.

I would prefer to avoid that. Instead, for every bug I'd rather identify the actual edge case that's:

causing the issue, and
is not covered by our testing.

In general, I don't quite follow the split into "regression" and regular tests (I also feel that folks interpret the split differently). From https://llvm.org/docs/TestingGuide.html#regression-tests

The regression tests are small pieces of code that test a specific feature of LLVM or trigger a specific bug in LLVM.

To me that sounds like a test for a specific (edge) case that "could go wrong". For completness, that same document defines "unit test" as:

Unit tests are written using Google Test and Google Mock and are located in the llvm/unittests directory.

But let's not diverge too much - we all agree that we need more testing and patches like this one 😅

Btw ...

There's no in-tree frontend compiler (except for flang?) to truly exercise the conversions on real hardware

We do have e2e tests for Linalg (and Vector and SparseTensor) that target SVE and SME. These tests are run by LLVM buildbots (there's more tests for other targets too, but I don't track that as closely). Glad to hear that people find these valuable :)

2. This patch

Test files

Now, looking at the files added in this PR:

mlir/test/Integration/Dialect/Arith/CPU/test-arithmetic.mlir
mlir/test/Integration/Dialect/Arith/CPU/test-i1-operations.mlir
mlir/test/Integration/Dialect/Arith/CPU/test-indices.mlir
mlir/test/Integration/Dialect/Arith/CPU/test-int-trunc-ext.mlir
mlir/test/Integration/Dialect/Arith/CPU/test-shifts.mlir

I see two conflicting splits:

one test file per operation (e.g. "test-shifts.mlir" and "test-int-trunc-ext.mlir"), vs
one test file per data type (e.g. "test-i1-operations.mlir" and "test-indices.mlir").

For consistency, could you select one and stick with that? My personal preference would be one file per operation (or group of similar operations) and then exercise "most interesting" data types in every file (index, i1, i8 and f32)?

Also, IMHO you can safely skip "test" in file names that are located in ... a "test" sub-directory (to reduce noise). That's a nit ;-)

Review process - multiple PRs

Given that you are testing a lot of edge cases, I'd like to go over every one of them and make sure they agree with my expectations. I am rising this especially in the context of the following PR in which we discovered that our folding logic for arith::CeilDivSIOp "overflows" in a case where the Op itself shouldn't:

[mlir][arith] Fix overflow bug in arith::CeilDivSIOp::fold #90947

As in, I think it's good to review every case for such tests. I'm happy to do that, but I'd require this PR to be split into smaller PRs (one per Op?) - this one has grown quite quickly (btw, absolutely no harm in having multiple small PRs). I would definitely extract the canonicalization tests into a separate PR.

Importantly, if other reviewers are happy with the content then I'm not going to block it. You already have +1 from @kuhar who is an expert in this area.

Cases to test

I see three major cases being exercised in this PR (and in general in arith):

"sane" inputs (e.g. 10 + 10 = 20)
"overflow" inputs (e.g. -MIN_INT)
"poison/non-poison" inputs
... you could also add nan

Why not split them along these axis (i.e. check 1., 2. and 3.) and either:

write a function for every op and pass different arguments depending on the case being checked (e.g. "sane" vs "overflow" vs "poison"), or
write a separate function for 1., 2. and 3.

Btw,

The current tests don't cover all cases, although I think it wouldn't be realistic to do so either - that would be a combinatorial explosion and is perhaps better explored using property-based testing :)

Agreed. I'm only suggesting to identify the key options/cases and to make sure every test makes it clear what case is being exercised. In my view the coverage proposed here is a good start. But we should prepare these tests for folks who'd like to add more tests at some later time (and to make it easy for them).

Running integration tests on different architectures

In your integration tests you don't specify the architecture on which to run the tests. In ideal world, these should work regardless of the CPU target. Sadly, this is not an ideal world :) A while back, @Lewuathe found a rather unpleasant corner cases highlighting a very fine difference in how nan values are treated:

[mlir][aarch64] Test failure: correctness.mlir #58531 (comment)

I doubt that that ^^^ would impact the tests in this PR, but just to clarify, given that no triple is specified - these tests are intended to be target agnostic, right? That's worth clarifying somewhere.

Final note

As I mentioned above, I won't block this PR. I wanted to share my ideas on the possible approach (I've approached this PR as an RFC). If you agree with my suggestions and decide to follow them, I'm happy to help with reviews. Most importantly, thank you for raising this very important topic, @pingshiyu !

Btw, I am OOO next week, so apologies if there's more delay from my side.

banach-space reviewed

View reviewed changes

mlir/test/Integration/Dialect/Arith/CPU/test-arithmetic.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/test-i1-operations.mlir Outdated Show resolved Hide resolved

pingshiyu added 2 commits

June 27, 2024 18:42


          updating formatting to move checks into tests, split based on operations

5a8923e


          use underscores for tests

267da37

kuhar reviewed

View reviewed changes

mlir/test/Integration/Dialect/Arith/CPU/addition.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/comparison.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/multiplication.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/multiplication.mlir Outdated Show resolved Hide resolved

mlir/test/Integration/Dialect/Arith/CPU/printing.mlir Outdated Show resolved Hide resolved


          Update mlir/test/Integration/Dialect/Arith/CPU/addition.mlir

96f470b

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>

Contributor Author

pingshiyu commented Aug 29, 2024 •

edited

Loading

@joker-eph @banach-space

Thank you for the comments!

On motivation: some of the existing tests do indeed test existing/historical bugs in MLIR, and others are motivated by bugs found in my independent implementation of MLIR semantics (which would be useful safeguards since those are bugs that may be tempting to introduce). From these, some extra tests were added for a more complete test suite (if they are very natural to add, e.g. true case if an existing one is testing the false case), and there's not an overwhelming number of them). I'll make sure to add more details in the comments about these where appropriate.

On integration vs. unit testing: there may indeed be some duplication here - like @banach-space suggested also audit these against existing unit tests once these are merged in! However, several of the bugs the tests cover are from lowering steps down to LLVM. To test the produced LLVM code, to the best of my knowledge, integration tests seem to be the way (comparing LLVM code directly, rather than executing them, would be a bit fragile imo), but would be happy to be corrected by someone who has deeper knowledge about the testing infrastructure.

As for runtime costs of these tests, I did some tests earlier today (personal laptop, default options, all integration tests), and altogether, the new tests add ~1 second to the total running time (263.67s with all new tests, 262.87s without, average taken over 3 runs).

Collaborator

joker-eph commented Aug 29, 2024

However, several of the bugs the tests cover are from lowering steps down to LLVM. To test the produced LLVM code, to the best of my knowledge, integration tests seem to be the way (comparing LLVM code directly, rather than executing them, would be a bit fragile imo)

I think this is where we differ: the main way of testing LLVM (and MLIR) is with unit-tests checking the lowering through IR matching, not integration tests. These are more exceptional.

Contributor Author

pingshiyu commented Aug 29, 2024 •

edited

Loading

@joker-eph for example, one that I was thinking of was this one: #83248

The bug was from a wrong lowering from Arith down to llvm. To prepare the test for this bug, we'd ultimately like a test for the functional "spec" of MLIR, containing a desired property: making sure the lowering produces the correct IR, and implementing the expected behaviour of the operation.

Of course, by spec, I don't mean a full formal spec, which doesn't exist, but rather a small aspect of it. here, it'd be the division behaving correctly on inputs that are at the boundary of int representation),

Whilst we could validate that the fold does indeed produce the desired IR, it doesn't seem to me that comparing the produced vs. expected IR is a super clear oracle (i.e. the expected IR being the algorithm implementing the fold using a lower-level IR). In this case, I think it is clearer to test the implemented algorithm, so we can check against the spec directly, e.g. by executing it, whether via a constant folder or via the llvm interpreter - if we are at the LLVM level. Moreover, there may be many valid lowerings for the same op, and such an IR-comparison oracle would also lead to more overhead during development.

Overall I think lowerings are where integration tests could provide quite a bit of value!

Collaborator

joker-eph commented Aug 29, 2024 •

edited

Loading

Since you're reasoning in term of "spec", we indeed think in terms of what the spec for floordivsi should be, and then we reason in terms of what the spec of the target of the lowering is. For example for expand ops it is the spec of the other arith op present here. Any kind of reasoning should be able to be done there: if we lower to LLVM we reason in term of the spec of the abstract LLVM machine and check that we program the right sequence of instruction as expected.

In this case, I think it is clearer to test the implemented algorithm, so we can check against the spec directly, e.g. by executing it,

You're not testing the implementation or the spec, just a particular behavior (the test may be UB but lucky for example), on some specific host and version of LLVM, against a single input. This is providing less coverage than an IR check from this point of view.

More importantly, this is also a question of coupling of the system and overall maintenance / cost question, and ultimately some testing philosophy (for which LLVM is historically fairly clear on the preference to IR lit-testing: you won't find any execution unit testing in LLVM like this outside of MLIR).

Contributor Author

pingshiyu commented Aug 29, 2024

@joker-eph Perhaps spec is the wrong word I used since we don't have full specs to begin with :) Maybe a better word is "expected behaviour"

Comparing IRs directly does test for more "things" in the sense that: the implementation is "equal" to some expected implementation. However, in the case linked above, I don't think it is checking against the "thing" we're concerned with.

An analogy that comes to mind, especially for the lowering bug mentioned, is akin to verifying a travelling salesman implementation by checking that the C code matches some expected implementation. Even then, we wouldn't be certain that the expected implementation is correct. A complementary approach is to run the C code on some small examples (ideally using a platform-independent tool that correctly implements C semantics, like CompCert). This approach also results in more comprehensible tests.

Similarly, comparing IRs doesn’t automatically instil confidence that the top-level IR exhibits the same expected behaviour as the lower-level IR. For instance, it's challenging (at least for me) to look at LLVM bytecode and determine with certainty that it will perform as intended for all inputs.

It's interesting to compare against LLVM too, since I think this discussion would only come about in MLIR due to the extensive lowerings.

Collaborator

joker-eph commented Aug 29, 2024

Even then, we wouldn't be certain that the expected implementation is correct.

Right, but that's "separation of concerns". It's all what unit-testing is about: we check for a component boundaries based on the expectation at these boundaries.

A complementary approach is to run the C code on some small examples

Right, I understand, but as as I mentioned before: we just don't do that historically in how LLVM compiler tests is set up.

It's interesting to compare against LLVM too, since I think this discussion would only come about in MLIR due to the extensive lowerings.

I personally don't see a different between the expand-math in MLIR and the same thing on LLVM IR though.

Contributor

banach-space commented Aug 29, 2024

I think this is where we differ: the main way of testing LLVM (and MLIR) is with unit-tests checking the lowering through IR matching, not integration tests. These are more exceptional.

I agree with Mehdi.

We (Arm folks working on SVE and SME support) tend to use e2e tests for rather complex examples (e.g. linalg.matmul) - but even those are mostly used as smoke tests. e2e tests are super helpful when evaluating complex compilation pipelines, but that's "integration" testing and should not be treated as a replacement for "unit" testing. That's how I look at the tests being added here.

On motivation: some of the existing tests do indeed test existing/historical bugs in MLIR

From my experience, such tests/bugs are symptoms of insufficient unit test coverage. But rather than using reproducers as is (these tend to be relatively complex), we should try to identify the corresponding edge case that has been missed and test that explicitly (it's much easier to reason about a test once the underlying edge case is clear).

As for runtime costs of these tests, I did some tests earlier today (personal laptop, default options, all integration tests), and altogether, the new tests add ~1 second to the total running time (263.67s with all new tests, 262.87s without, average taken over 3 runs).

This is re-assuring, thanks for checking. Still, we should make sure that we don't encourage folks to add too many new cases. My initial steer was to test all data types, but now I realise that we should avoid that.

For instance, it's challenging (at least for me) to look at LLVM bytecode and determine with certainty that it will perform as intended for all inputs.

Please note that we will never run tests for all possible inputs ;-)

Just to summarise, I do find these new e2e tests helpful, but lets limit the scope and focus on a small set of interesting cases. @pingshiyu , what other Ops do you intend to test? Once you finish, we could consolidate all tests into one file. We can use correctnes.mlir as an indication (roughly) of what the scope should be.

Contributor Author

pingshiyu commented Aug 30, 2024 •

edited

Loading

Right, but that's "separation of concerns". It's all what unit-testing is about: we check for a component boundaries based on the expectation at these boundaries.

I do agree that the scope of tests should be limited, but I think where we differ is what the "expectation" should be.
Take the example of the lowering for instance: we want the test to give us confidence that translated code is "correct".
https://github.com/llvm/llvm-project/pull/83248/files#diff-71577b98cdb92d5eb7583c3c3d5b9488f017f8f56dc6d05b7b940848f6e0bfc9
However, checking that "the translated code is equal to what we expect" does not give us confidence in correctness here - it is only checking the rewrite pattern has done its job. It is not checking that the lowering is correct, which is desirable too for the pass. I don't think the linked unit-test, by itself, gives us much confidence that the bug has even been fixed: we'd need to run the code (mentally, or via interpreter) on the problem input described in the bug report to be sure.

I believe it's clearly desirable to test the functionality of the lowering, though. And I don't see how this can be done without something to the effect of an integration test.

@banach-space the ones referenced in #100121 would be all the ops I intend to test in this work!

Collaborator

joker-eph commented Aug 30, 2024

However, checking that "the translated code is equal to what we expect" does not give us confidence in correctness here - it is only checking the rewrite pattern has done its job.

I don't quite follow the difference you're trying to make.

I believe it's clearly desirable to test the functionality of the lowering, though. And I don't see how this can be done without something to the effect of an integration test.

I feel we're running in circle: we're checking this because the IR you generate to has a spec and you check manually the equivalence by inspecting the IR ("running mentally" is what you wrote).
Again virtually the entirety of transformations in LLVM are developed that way ~ forever, we could change our testing approach of course, but IMO that's an RFC for the LLVM discourse.

In the meantime, tests like this are out-of-scope of what I expect to see here.

Contributor Author

pingshiyu commented Aug 31, 2024 •

edited

Loading

I don't quite follow the difference you're trying to make.

We want unit tests to protect against previous bugs, and also to serve as documentation of the expected behaviour.

For lowering passes of MLIR, bugs aren't caused by the lowering rewrite not being applied properly, but because the rewrite pattern itself is incorrect.

A unit test that applies a pattern and checks the resulting IR doesn't check the pattern itself is correct. This is what an integration/execution test would do.

Going back to the example of floordiv, the unit test goes something like:

// CHECK:   %[[QUOTIENT:.*]] = arith.divsi %arg0, %arg1 : i32
// CHECK:   %[[PRODUCT:.*]] = arith.muli %[[QUOTIENT]], %arg1 : i32
// ... etc

i.e. just describes the pattern itself - and this is, really, the only test that can be written for the lowering pattern.

If this is all the testing we have for the lowering, then the issues I see are:

When the lowering gets updated, we have no way to know if old bugs are triggered by this new lowering. We lose some benefits of unit testing in protecting against previous bugs.
In fact, we don't even know what the old bug-triggering cases are, because every time the pattern gets updated the test changes too. We lose the benefit of documenting expected behaviour.
Every case the developer wants to check has to be done manually, the CI/CD doesn't help them much compared to simply inspecting the lowering pattern.

An integration test would resolve the above issues for lowering passes.

I'm not proposing to change how LLVM transformations are developed. But for MLIR's lowering passes, there's a real need for integration tests like this - not to replace how unit testing is done for all transformations.

There may be a way to do what I've described above as a unit test, in which case please let me know! Also, let me know if I wasn't clear anywhere :)

Collaborator

joker-eph commented Aug 31, 2024

I'm not proposing to change how LLVM transformations are developed.

Not developed, just tested.

But for MLIR's lowering passes,

I don't see any difference right now between the example at hand here for MLIR and most transformations in LLVM.

Contributor Author

pingshiyu commented Aug 31, 2024 •

edited

Loading

I don't see any difference right now between the example at hand here for MLIR and most transformations in LLVM.

The difference is that for the majority of cases, there is only one unit test that can be written for the lowering pattern: basically the test is the lowering pattern itself (i.e. the example shown). This is a problem for lowering passes only and not for other transformations.

Because of this, it results in insufficient testing for the lowering passes imo.

Collaborator

joker-eph commented Aug 31, 2024 •

edited

Loading

This is a problem for lowering passes only and not for other transformations.

Why isn't it the case for a canonicalization pattern?

Also what is different between LLVM instruction selection and a MLIR lowering pattern?

Contributor Author

pingshiyu commented Sep 1, 2024

Why isn't it the case for a canonicalization pattern?

You're right, unit tests don't test the correctness of any patterns - but integration tests would.

Pattern correctness can also be a problem for canonicalization patterns, I remember quite a few bugs due to incorrect canonicalizers.

But lowerings are easier to test in an integration test - it's easy to control when they're applied and they're applied in isolation.

OTOH multiple canonicalization patterns may apply and are also applied greedily, so IMO it's harder to isolate them and write integration tests for (correct me if I'm wrong!).

Also what is different between LLVM instruction selection and a MLIR lowering pattern?

I'm not very familiar with how they're tested so I cannot compare the two, sorry!

Contributor

banach-space commented Sep 1, 2024

In the meantime, tests like this are out-of-scope of what I expect to see here.

Thanks for this feedback, Mehdi!

I feel that we are missing some guidelines on what the expected scope of integration tests should be. My high-level suggestion (based on this discussion) would be that integration tests should be reserved for cases much more complex than e.g. (extracted from comparison.mlir that Mehdi referred to uptrhead):

arith.cmpi -> llvm.icmp.

Admittedly, that's not particularly involved. However, examples from expand-ops.mlir (including arith.floordivsi that @pingshiyu mentioned) seem to be more complex and perhaps we should port those as e2e tests instead?

I am happy to work on a proposal, but will be OoO next two weeks (perhaps somebody else is free to volunteer?). In the meantime, we should probably pause this effort.

A unit test that applies a pattern and checks the resulting IR doesn't check the pattern itself is correct.

I don't follow this comment. Unit tests are there specifically to verify that the corresponding pattern is correct. In addition, every pattern is verified through code-review.

checking that "the translated code is equal to what we expect" does not give us confidence in correctness here

IMHO, in the example that you referred to, one "magic" formula for expanding floordvisi was replaced with another. IIUC, the only thing that was incorrect was the original formula. Perhaps the real issue here is that we are missing some formal spec for these things?

An integration test would resolve the above issues for lowering passes.

I am not convinced. I can see how an integration test would prevent us from hitting #83079, but we will never have an e2e for every possible scenario. That's just not feasible and against the grain of MLIR and LLVM. As for #83079, IMHO, we were (and still are) missing a formal spec for expanding floordivsi. An e2e would/will help, yes, but should be considered as complementary to a spec rather than a replacement.

While it sounds like we probably should revert some of the newly added e2e tests, I feel that overall we are making important progress in terms of test coverage for Arith. Thank you @pingshiyu !

Contributor Author

pingshiyu commented Sep 1, 2024 •

edited

Loading

@banach-space thanks for chiming in!

I don't follow this comment. Unit tests are there specifically to verify that the corresponding pattern is correct. In addition, every pattern is verified through code-review.

As far as I can see, unit tests don't verify that a pattern is correct: I mean correct in the sense that for a pattern source -> target, source and target have the same semantics, in all contexts. Unit tests only check that the pattern is applied. Integration tests can act as tests for the pattern's correctness though (of course they're not proof for the pattern's correctness, but serve as unit tests for patterns).

An e2e would/will help, yes, but should be considered as complementary to a spec rather than a replacement.

Oh yes absolutely! It was never the intention to replace formal semantics with e2e tests :) During working on this PR and conversation in this thread I realised that integration tests could fill in some testing gaps in the correctness of rewrite patterns, that are not covered by unit tests.

IIUC, the only thing that was incorrect was the original formula. Perhaps the real issue here is that we are missing some formal spec for these things?

Indeed the formula (or, the correctness of the lowering pattern) is the part that wasn't able to be tested by a unit test - and an e2e test would be able to provide test coverage for it. In the absence of formal semantics, and for potential complex lowerings where formal semantics might be absent/arrive slowly, I think e2e tests could, like you said, be complementary to unit tests (they test orthogonal things) :)

I am actually currently developing semi-formal semantics in Haskell for some MLIR operations. The integration tests in this suite were all "contentious" cases which are bugs on either the Haskell side during development or the MLIR side - an indication that these cases are "complex" and have the potential to go wrong. The sources of these bugs are mostly unsound optimisation patterns and lowerings - parts which aren't covered by MLIR unit testing

Contributor

banach-space commented Sep 1, 2024

As far as I can see, unit tests don't verify that a pattern is correct: I mean correct in the sense that for a pattern source -> target, source and target have the same semantics, in all contexts.

Thanks for clarifying, now I see what you meant. Still, I think that one of Mehdi's previous comments is quite fitting here:

Again virtually the entirety of transformations in LLVM are developed that way ~ forever, we could change our testing approach of course, but IMO that's an RFC for the LLVM discourse.

I will paraphrase - you are proposing a different approach towards testing compared to what's normally done in LLVM/MLIR. We could change that, sure, but it's a discussion beyond Arith and MLIR, and more suitable for LLVM Discourse.

Indeed the formula (or, the correctness of the lowering pattern) is the part that wasn't able to be tested by a unit test - and an e2e test would be able to provide test coverage for it.

Right, but I don't believe that we should be using MLIR integration tests to verify that a particular Mathematical formula is correct. Instead, we should select a formula that is known to be correct (surely there are some good references) and implement that. Testing in MLIR should focus on making sure that the implementation itself is correct, not the formula. Does it make sense?

Having said that, IMHO, it would be totally fine to have e2e for things like floordivsi. That's a good example of "this is complex and let's complement unit tests with integration tests".

I am actually currently developing semi-formal semantics in Haskell for some MLIR operations.

Exciting! It would be interesting to see how you make sure that the semantics in Haskell match MLIR (and whether there's scope for writing a test-suite). That's a completely different thread though 😅

Contributor Author

pingshiyu commented Sep 1, 2024 •

edited

Loading

@banach-space @joker-eph

We could change that, sure, but it's a discussion beyond Arith and MLIR, and more suitable for LLVM Discourse.

Yes, thanks for highlighting this. I do understand now that this is a deviation from what people are used to. I'm not very familiar with tests on the LLVM side.

Maybe we could work together on the integration testing guidelines proposal, as we seem to want changes in similar directions :) What other ways can I contact you to discuss this further? I am on Discord as well and my email is on my profile.

I have concerns about raising this with the whole LLVM community because it is quite a substantial scope expansion compared to what I had in mind: I thought integration tests would be useful for rewrites in MLIR, and this PR is initiating that for Arith.

There's a focus on lowering patterns here as they are easier to test using integration tests (hard to control what canonicalization patterns get applied) - and the extensive use of lowerings is characteristic of MLIR rather than LLVM.

we should select a formula that is known to be correct (surely there are some good references) and implement that.

I don't quite agree with this in the context of general testing approaches for MLIR. We can't expect to always have a well-known, or verified formula for all rewrite patterns. Even then, all rewrite patterns are implementations of some formula - and implementations can have bugs, which would benefit from validation using integration tests (since bugs in the patterns cannot be detected by unit tests).

(Note by validation I mean a unit-testing-like approach rather than an exhaustive test suite for all cases: AFAIK patterns don't have any unit testing at all currently, and the question of whether they are correct, or correctly implement some formula, hinges almost entirely upon code reviews)

how you make sure that the semantics in Haskell match MLIR (and whether there's scope for writing a test-suite).

It's funny you mention this! xD The way I am making sure the semantics match is concurrently developing a fuzzer to generate MLIR programs (e.g. lots of random integration tests), and comparing the semantics against the MLIR compilation results.

This very test suite in this PR actually comes from this process. Every test case here is a bug in either MLIR or in the Haskell semantics - a place where they disagreed at some point during development :)

Finally, for this particular series of PRs (integration tests for Arith), I think the following are the value add:

it adds coverage that is beyond the scope of unit tests (testing the patterns themselves)
previous and existing MLIR bugs do benefit from this additional coverage ([mlir][arith] ceildivsi wrong lowering #106519 is a case where a previous bug was "fixed", a unit test was added, but it became apparent when running the integration test that the bug was not completely fixed after all. that bug would have benefitted from integration tests!)

But we should probably return to this once we get some feedback from the wider MLIR community on testing approaches :)

Contributor

banach-space commented Sep 1, 2024 •

edited

Loading

Just a couple of clarifications.

Maybe we could work together on the integration testing guidelines proposal, as we seem to want changes in similar directions :)

What I had in mind in my earlier post was merely updating the MLIR docs to clarify what the expected scope of e2e tests should be.

I think that you want to expand the scope of such tests. I am not convinced that that's the right approach.

I don't quite agree with this in the context of general testing approaches for MLIR. We can't expect to always have a well-known, or verified formula for all rewrite patterns.

I was referring specifically to floordivsi as your motivating example and this formula/expansion.

banach-space added a commit to banach-space/mlir-www that referenced this pull request


          Add a paragraph on integration tests

a5d8021

Adds some guidelines on what qualifies as a good candidate for an e2e
tests and in which cases to avoid integration tests.

Follow-up from this discussion:
	* llvm/llvm-project#92272

banach-space mentioned this pull request

Add a paragraph on integration tests llvm/mlir-www#203

Merged

Contributor

banach-space commented Oct 7, 2024

@joker-eph and @pingshiyu, sorry for the delay.

What I had in mind in my earlier post was merely updating the MLIR docs to clarify what the expected scope of e2e tests should be.

Please take a look: llvm/mlir-www#203. Once the docs are updated, we should review the newly added e2e tests and remove cases already tested through other means.

joker-eph added a commit to llvm/mlir-www that referenced this pull request


          Add a paragraph on integration tests (#203)

0029ecc

Adds some guidelines on what qualifies as a good candidate for an e2e
tests and in which cases to avoid integration tests.

Follow-up from this discussion:
	* llvm/llvm-project#92272

---------

Co-authored-by: Mehdi Amini <joker.eph@gmail.com>

banach-space added a commit to banach-space/llvm-project that referenced this pull request


          [mlir][arith] Remove some e2e tests

c215042

I am removing the recently added integration test for various Arith Ops.
These operations and their lowerings are effectively already verified by
the Arith-to-LLVM conversion tests in:
  * "mlir/test/Conversion/ArithToLLVM/arith-to-llvm.mlir"

I've noticed that a few variants of arith.cmpi were missing in that file
- those are added here as well.

This is a follow-up for this discussion:
  * llvm#92272

See also the recent update to our guidelines on e2e tests in MLIR:
  * llvm/mlir-www#203

banach-space mentioned this pull request

[mlir][arith] Remove some e2e tests #112012

Merged

banach-space added a commit that referenced this pull request


          [mlir][arith] Remove some e2e tests (#112012)

37ad65f

I am removing the recently added integration test for various Arith Ops.
These operations and their lowerings are effectively already verified by
the Arith-to-LLVM conversion tests in:
  * "mlir/test/Conversion/ArithToLLVM/arith-to-llvm.mlir"

I've noticed that a few variants of `arith.cmpi` were missing in that
file - those are added here as well.

This is a follow-up for this discussion:
  * #92272

See also the recent update to our guidelines on e2e tests in MLIR:
  * llvm/mlir-www#203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mlir:arith mlir