-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][arith] Add more canonicalization and integration tests coverage #92272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-mlir-arith Author: Jacob Yu (pingshiyu) ChangesI've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations. It found a bunch of bugs in the However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter. I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p). The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests. Please do let me know if adding these would be helpful, and what's the preferred format for them. Once/if that's all agreed upon I can add the rest of the regression suite! Full diff: https://github.com/llvm/llvm-project/pull/92272.diff 2 Files Affected:
diff --git a/mlir/test/Dialect/Arith/canonicalize.mlir b/mlir/test/Dialect/Arith/canonicalize.mlir
index e4f95bb0545a2..aa79ad1bead56 100644
--- a/mlir/test/Dialect/Arith/canonicalize.mlir
+++ b/mlir/test/Dialect/Arith/canonicalize.mlir
@@ -3022,6 +3022,36 @@ func.func @minsi_i0() -> i0 {
return %minsi : i0
}
+// CHECK-LABEL: @extsiOnI1
+// CHECK: %[[TRUE:.*]] = arith.constant true
+// CHECK: %[[CST:.*]] = arith.constant -1 : i16
+// CHECK: return %[[TRUE]], %[[CST]]
+func.func @extsiOnI1() -> (i1, i16) {
+ %true = arith.constant -1 : i1
+ %0 = arith.extsi %true : i1 to i16
+ return %true, %0 : i1, i16
+}
+
+// CHECK-LABEL: @extuiOn1I1
+// CHECK: %[[TRUE:.*]] = arith.constant true
+// CHECK: %[[CST:.*]] = arith.constant 1 : i64
+// CHECK: return %[[TRUE]], %[[CST]]
+func.func @extuiOn1I1() -> (i1, i64) {
+ %true = arith.constant true
+ %0 = arith.extui %true : i1 to i64
+ return %true, %0 : i1, i64
+}
+
+// CHECK-LABEL: @trunciI16ToI8
+// CHECK: %[[CST:.*]] = arith.constant 20194 : i16
+// CHECK: %[[CST2:.*]] = arith.constant -30 : i8
+// CHECK: return %[[CST]], %[[CST2]]
+func.func @trunciI16ToI8() -> (i16, i8) {
+ %c20194_i16 = arith.constant 20194 : i16
+ %0 = arith.trunci %c20194_i16 : i16 to i8
+ return %c20194_i16, %0 : i16, i8
+}
+
// CHECK-LABEL: @mulsi_extended_i0
// CHECK: %[[ZERO:.*]] = arith.constant 0 : i0
// CHECK: return %[[ZERO]], %[[ZERO]] : i0
diff --git a/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
new file mode 100644
index 0000000000000..cd1cad4d56da9
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
@@ -0,0 +1,52 @@
+// Regression test suite from an independent development of Arith's
+// semantics and a fuzzer
+
+// RUN: mlir-opt %s --convert-scf-to-cf --convert-cf-to-llvm --convert-vector-to-llvm \
+// RUN: --convert-func-to-llvm --convert-arith-to-llvm | \
+// RUN: mlir-cpu-runner -e entry -entry-point-result=void \
+// RUN: --shared-libs=%mlir_c_runner_utils | \
+// RUN: FileCheck %s --match-full-lines
+
+module {
+ func.func @extsiOnI1() {
+ %true = arith.constant -1 : i1
+ %0 = arith.extsi %true : i1 to i16
+ vector.print %true : i1
+ vector.print %0 : i16
+ return
+ }
+
+ func.func @extuiOn1I1() {
+ %true = arith.constant true
+ %0 = arith.extui %true : i1 to i64
+ vector.print %true : i1
+ vector.print %0 : i64
+ return
+ }
+
+ func.func @trunciI16ToI8() {
+ %c20194_i16 = arith.constant 20194 : i16
+ %0 = arith.trunci %c20194_i16 : i16 to i8
+ vector.print %c20194_i16 : i16
+ vector.print %0 : i8
+ return
+ }
+
+ func.func @entry() {
+
+ // CHECK: 1
+ // CHECK: -1
+ func.call @extsiOnI1() : () -> ()
+
+ // CHECK: 1
+ // CHECK: 1
+ func.call @extuiOn1I1() : () -> ()
+
+ // CHECK: 20194
+ // CHECK: -30
+ func.call @trunciI16ToI8() : () -> ()
+
+
+ return
+ }
+}
\ No newline at end of file
|
@llvm/pr-subscribers-mlir Author: Jacob Yu (pingshiyu) ChangesI've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations. It found a bunch of bugs in the However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter. I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p). The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests. Please do let me know if adding these would be helpful, and what's the preferred format for them. Once/if that's all agreed upon I can add the rest of the regression suite! Full diff: https://github.com/llvm/llvm-project/pull/92272.diff 2 Files Affected:
diff --git a/mlir/test/Dialect/Arith/canonicalize.mlir b/mlir/test/Dialect/Arith/canonicalize.mlir
index e4f95bb0545a2..aa79ad1bead56 100644
--- a/mlir/test/Dialect/Arith/canonicalize.mlir
+++ b/mlir/test/Dialect/Arith/canonicalize.mlir
@@ -3022,6 +3022,36 @@ func.func @minsi_i0() -> i0 {
return %minsi : i0
}
+// CHECK-LABEL: @extsiOnI1
+// CHECK: %[[TRUE:.*]] = arith.constant true
+// CHECK: %[[CST:.*]] = arith.constant -1 : i16
+// CHECK: return %[[TRUE]], %[[CST]]
+func.func @extsiOnI1() -> (i1, i16) {
+ %true = arith.constant -1 : i1
+ %0 = arith.extsi %true : i1 to i16
+ return %true, %0 : i1, i16
+}
+
+// CHECK-LABEL: @extuiOn1I1
+// CHECK: %[[TRUE:.*]] = arith.constant true
+// CHECK: %[[CST:.*]] = arith.constant 1 : i64
+// CHECK: return %[[TRUE]], %[[CST]]
+func.func @extuiOn1I1() -> (i1, i64) {
+ %true = arith.constant true
+ %0 = arith.extui %true : i1 to i64
+ return %true, %0 : i1, i64
+}
+
+// CHECK-LABEL: @trunciI16ToI8
+// CHECK: %[[CST:.*]] = arith.constant 20194 : i16
+// CHECK: %[[CST2:.*]] = arith.constant -30 : i8
+// CHECK: return %[[CST]], %[[CST2]]
+func.func @trunciI16ToI8() -> (i16, i8) {
+ %c20194_i16 = arith.constant 20194 : i16
+ %0 = arith.trunci %c20194_i16 : i16 to i8
+ return %c20194_i16, %0 : i16, i8
+}
+
// CHECK-LABEL: @mulsi_extended_i0
// CHECK: %[[ZERO:.*]] = arith.constant 0 : i0
// CHECK: return %[[ZERO]], %[[ZERO]] : i0
diff --git a/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
new file mode 100644
index 0000000000000..cd1cad4d56da9
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir
@@ -0,0 +1,52 @@
+// Regression test suite from an independent development of Arith's
+// semantics and a fuzzer
+
+// RUN: mlir-opt %s --convert-scf-to-cf --convert-cf-to-llvm --convert-vector-to-llvm \
+// RUN: --convert-func-to-llvm --convert-arith-to-llvm | \
+// RUN: mlir-cpu-runner -e entry -entry-point-result=void \
+// RUN: --shared-libs=%mlir_c_runner_utils | \
+// RUN: FileCheck %s --match-full-lines
+
+module {
+ func.func @extsiOnI1() {
+ %true = arith.constant -1 : i1
+ %0 = arith.extsi %true : i1 to i16
+ vector.print %true : i1
+ vector.print %0 : i16
+ return
+ }
+
+ func.func @extuiOn1I1() {
+ %true = arith.constant true
+ %0 = arith.extui %true : i1 to i64
+ vector.print %true : i1
+ vector.print %0 : i64
+ return
+ }
+
+ func.func @trunciI16ToI8() {
+ %c20194_i16 = arith.constant 20194 : i16
+ %0 = arith.trunci %c20194_i16 : i16 to i8
+ vector.print %c20194_i16 : i16
+ vector.print %0 : i8
+ return
+ }
+
+ func.func @entry() {
+
+ // CHECK: 1
+ // CHECK: -1
+ func.call @extsiOnI1() : () -> ()
+
+ // CHECK: 1
+ // CHECK: 1
+ func.call @extuiOn1I1() : () -> ()
+
+ // CHECK: 20194
+ // CHECK: -30
+ func.call @trunciI16ToI8() : () -> ()
+
+
+ return
+ }
+}
\ No newline at end of file
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Quick question, what kind of bugs did you find in the canonicalization patterns and the lowerings?
Do you have issues/PR for them already?
Also, is your fuzzer/semantics open-source somewhere? It would be interesting for me to see them, to compare them with the one I currently have for arith (https://github.com/opencompl/xdsl-smt/blob/main/xdsl_smt/semantics/arith_semantics.py)
Co-authored-by: Fehr Mathieu <mathieu.fehr@gmail.com>
Yep! The bugs found by this method are here (some are as of now unfixed! :o) The fuzzer/semantics are unfortunately not open source at the moment, but I do plan on making it so in the near future! There's a bit more I plan to work on before it's public :) |
I want to ask: what is the preferred format for these tests? Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too. |
mlir/test/Integration/Dialect/Arith/CPU/test-int-trunc-ext.mlir
Outdated
Show resolved
Hide resolved
Historically speaking, we are not relying to much on integration tests. I'm not against them on principle though, as long as they are tiny (fast to execute) and easy to maintain. One important aspect though is that our test coverage does not rely on the integration tests: all of our canonicalization should be unit-tested. |
Thanks for sharing these, @pingshiyu ! I'm always in favour of improving our test coverage, but if we intend to add much more then it would be good to have some plan. Another question that I'd ask - how far do we go with testing? Do we want to exercise every possible combination? For example, looking at "test-int-trunc-ext.mlir" that you added, there are the following 3 cases:
Why not more, i.e. other data types and operations? Once we consider all the possibilities, the potential set of all tests becomes really large. This can be a maintenance burden. Are we worried? Also, if we are to add many more tests, it would be good to establish some structure so that it's easy to identify cases that are covered and to find cases that are missing. Otherwise, we tend to duplicate tests, I've seen examples in the Vector dialect, not sure about Arith. This can usually be improved with comments and consistent test function naming.
I find integration tests very helpful, but they tend to be expensive and hence hidden behind a CMake flag. Improving the coverage for Arith would be a welcome addition, but it would be good to have a strategy for selecting good test cases. To me, this is a very good reference: IIUC, there's a case for every operation and every Op is verified with a set of inputs. I know that @Lewuathe has recently contributed there - would you have any recommendation for good cases to test? |
Definitely a good point on having the new tests be organised in a way that it's easy to identify gaps! The current tests don't cover all cases, although I think it wouldn't be realistic to do so either - that would be a combinatorial explosion and is perhaps better explored using property-based testing :) For this PR, my aim is to just add all the tests that I have accrued from the interpreter exercise, and most of these were once a bug within the interpreter In the most recent commit I've added the remainder of the tests as integration tests. I avoided the I've assigned the tests a rough category for what they're testing, according to the file names added :) |
Some good points here! IMO this specific PR falls under the broader umbrella of regression tests -- I think there's a lot of values in handling the corner cases we did not handle properly in the past. We should definitely have LIT test coverage there. In llvm it used to be much common to check in tests taken verbatim from bugzilla in individual regression test files. I wouldn't go probably go as far, and having the regression tests organized in some logical buckets makes sense to me. Similarly, exhaustively testing every permutation of types / operations is unlikely to add much value on it's own, but I wouldn't mind adding more bugs if we do discover that our coverage is insufficient and start seeing bugs being reported. As for the integration tests, I think MLIR is really undertested. There's no in-tree frontend compiler (except for flang?) to truly exercise the conversions on real hardware and I think there's also value to have integration tests for conversions that may be tricky to get right (e.g., wide integer emulation) or easy to miscompile by the backend compiler. I think these extension tests fir in nicely here. |
Thank you all for replying to my comments! 1. Wider discussion on testing
I would prefer to avoid that. Instead, for every bug I'd rather identify the actual edge case that's:
In general, I don't quite follow the split into "regression" and regular tests (I also feel that folks interpret the split differently). From https://llvm.org/docs/TestingGuide.html#regression-tests
To me that sounds like a test for a specific (edge) case that "could go wrong". For completness, that same document defines "unit test" as:
But let's not diverge too much - we all agree that we need more testing and patches like this one 😅 Btw ...
We do have e2e tests for Linalg (and Vector and SparseTensor) that target SVE and SME. These tests are run by LLVM buildbots (there's more tests for other targets too, but I don't track that as closely). Glad to hear that people find these valuable :) 2. This patchTest filesNow, looking at the files added in this PR:
I see two conflicting splits:
For consistency, could you select one and stick with that? My personal preference would be one file per operation (or group of similar operations) and then exercise "most interesting" data types in every file ( Also, IMHO you can safely skip "test" in file names that are located in ... a "test" sub-directory (to reduce noise). That's a nit ;-) Review process - multiple PRsGiven that you are testing a lot of edge cases, I'd like to go over every one of them and make sure they agree with my expectations. I am rising this especially in the context of the following PR in which we discovered that our folding logic for As in, I think it's good to review every case for such tests. I'm happy to do that, but I'd require this PR to be split into smaller PRs (one per Op?) - this one has grown quite quickly (btw, absolutely no harm in having multiple small PRs). I would definitely extract the canonicalization tests into a separate PR. Importantly, if other reviewers are happy with the content then I'm not going to block it. You already have +1 from @kuhar who is an expert in this area. Cases to testI see three major cases being exercised in this PR (and in general in arith):
Why not split them along these axis (i.e. check 1., 2. and 3.) and either:
Btw,
Agreed. I'm only suggesting to identify the key options/cases and to make sure every test makes it clear what case is being exercised. In my view the coverage proposed here is a good start. But we should prepare these tests for folks who'd like to add more tests at some later time (and to make it easy for them). Running integration tests on different architecturesIn your integration tests you don't specify the architecture on which to run the tests. In ideal world, these should work regardless of the CPU target. Sadly, this is not an ideal world :) A while back, @Lewuathe found a rather unpleasant corner cases highlighting a very fine difference in how I doubt that that ^^^ would impact the tests in this PR, but just to clarify, given that no triple is specified - these tests are intended to be target agnostic, right? That's worth clarifying somewhere. Final noteAs I mentioned above, I won't block this PR. I wanted to share my ideas on the possible approach (I've approached this PR as an RFC). If you agree with my suggestions and decide to follow them, I'm happy to help with reviews. Most importantly, thank you for raising this very important topic, @pingshiyu ! Btw, I am OOO next week, so apologies if there's more delay from my side. |
mlir/test/Integration/Dialect/Arith/CPU/test-i1-operations.mlir
Outdated
Show resolved
Hide resolved
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Thank you for the comments! On motivation: some of the existing tests do indeed test existing/historical bugs in MLIR, and others are motivated by bugs found in my independent implementation of MLIR semantics (which would be useful safeguards since those are bugs that may be tempting to introduce). From these, some extra tests were added for a more complete test suite (if they are very natural to add, e.g. On integration vs. unit testing: there may indeed be some duplication here - like @banach-space suggested also audit these against existing unit tests once these are merged in! However, several of the bugs the tests cover are from lowering steps down to LLVM. To test the produced LLVM code, to the best of my knowledge, integration tests seem to be the way (comparing LLVM code directly, rather than executing them, would be a bit fragile imo), but would be happy to be corrected by someone who has deeper knowledge about the testing infrastructure. As for runtime costs of these tests, I did some tests earlier today (personal laptop, default options, all integration tests), and altogether, the new tests add ~1 second to the total running time (263.67s with all new tests, 262.87s without, average taken over 3 runs). |
I think this is where we differ: the main way of testing LLVM (and MLIR) is with unit-tests checking the lowering through IR matching, not integration tests. These are more exceptional. |
@joker-eph for example, one that I was thinking of was this one: #83248 The bug was from a wrong lowering from Arith down to llvm. To prepare the test for this bug, we'd ultimately like a test for the functional "spec" of MLIR, containing a desired property: making sure the lowering produces the correct IR, and implementing the expected behaviour of the operation. Of course, by spec, I don't mean a full formal spec, which doesn't exist, but rather a small aspect of it. here, it'd be the division behaving correctly on inputs that are at the boundary of int representation), Whilst we could validate that the fold does indeed produce the desired IR, it doesn't seem to me that comparing the produced vs. expected IR is a super clear oracle (i.e. the expected IR being the algorithm implementing the fold using a lower-level IR). In this case, I think it is clearer to test the implemented algorithm, so we can check against the spec directly, e.g. by executing it, whether via a constant folder or via the llvm interpreter - if we are at the LLVM level. Moreover, there may be many valid lowerings for the same op, and such an IR-comparison oracle would also lead to more overhead during development. Overall I think lowerings are where integration tests could provide quite a bit of value! |
Since you're reasoning in term of "spec", we indeed think in terms of what the spec for floordivsi should be, and then we reason in terms of what the spec of the target of the lowering is. For example for expand ops it is the spec of the other arith op present here. Any kind of reasoning should be able to be done there: if we lower to LLVM we reason in term of the spec of the abstract LLVM machine and check that we program the right sequence of instruction as expected.
You're not testing the implementation or the spec, just a particular behavior (the test may be UB but lucky for example), on some specific host and version of LLVM, against a single input. This is providing less coverage than an IR check from this point of view. More importantly, this is also a question of coupling of the system and overall maintenance / cost question, and ultimately some testing philosophy (for which LLVM is historically fairly clear on the preference to IR lit-testing: you won't find any execution unit testing in LLVM like this outside of MLIR). |
@joker-eph Perhaps spec is the wrong word I used since we don't have full specs to begin with :) Maybe a better word is "expected behaviour" Comparing IRs directly does test for more "things" in the sense that: the implementation is "equal" to some expected implementation. However, in the case linked above, I don't think it is checking against the "thing" we're concerned with. An analogy that comes to mind, especially for the lowering bug mentioned, is akin to verifying a travelling salesman implementation by checking that the C code matches some expected implementation. Even then, we wouldn't be certain that the expected implementation is correct. A complementary approach is to run the C code on some small examples (ideally using a platform-independent tool that correctly implements C semantics, like CompCert). This approach also results in more comprehensible tests. Similarly, comparing IRs doesn’t automatically instil confidence that the top-level IR exhibits the same expected behaviour as the lower-level IR. For instance, it's challenging (at least for me) to look at LLVM bytecode and determine with certainty that it will perform as intended for all inputs. It's interesting to compare against LLVM too, since I think this discussion would only come about in MLIR due to the extensive lowerings. |
Right, but that's "separation of concerns". It's all what unit-testing is about: we check for a component boundaries based on the expectation at these boundaries.
Right, I understand, but as as I mentioned before: we just don't do that historically in how LLVM compiler tests is set up.
I personally don't see a different between the expand-math in MLIR and the same thing on LLVM IR though. |
I agree with Mehdi. We (Arm folks working on SVE and SME support) tend to use e2e tests for rather complex examples (e.g.
From my experience, such tests/bugs are symptoms of insufficient unit test coverage. But rather than using reproducers as is (these tend to be relatively complex), we should try to identify the corresponding edge case that has been missed and test that explicitly (it's much easier to reason about a test once the underlying edge case is clear).
This is re-assuring, thanks for checking. Still, we should make sure that we don't encourage folks to add too many new cases. My initial steer was to test all data types, but now I realise that we should avoid that.
Please note that we will never run tests for all possible inputs ;-) Just to summarise, I do find these new e2e tests helpful, but lets limit the scope and focus on a small set of interesting cases. @pingshiyu , what other Ops do you intend to test? Once you finish, we could consolidate all tests into one file. We can use correctnes.mlir as an indication (roughly) of what the scope should be. |
I do agree that the scope of tests should be limited, but I think where we differ is what the "expectation" should be. I believe it's clearly desirable to test the functionality of the lowering, though. And I don't see how this can be done without something to the effect of an integration test. @banach-space the ones referenced in #100121 would be all the ops I intend to test in this work! |
I don't quite follow the difference you're trying to make.
I feel we're running in circle: we're checking this because the IR you generate to has a spec and you check manually the equivalence by inspecting the IR ("running mentally" is what you wrote). In the meantime, tests like this are out-of-scope of what I expect to see here. |
We want unit tests to protect against previous bugs, and also to serve as documentation of the expected behaviour. For lowering passes of MLIR, bugs aren't caused by the lowering rewrite not being applied properly, but because the rewrite pattern itself is incorrect. A unit test that applies a pattern and checks the resulting IR doesn't check the pattern itself is correct. This is what an integration/execution test would do. Going back to the example of
i.e. just describes the pattern itself - and this is, really, the only test that can be written for the lowering pattern. If this is all the testing we have for the lowering, then the issues I see are:
An integration test would resolve the above issues for lowering passes. I'm not proposing to change how LLVM transformations are developed. But for MLIR's lowering passes, there's a real need for integration tests like this - not to replace how unit testing is done for all transformations. There may be a way to do what I've described above as a unit test, in which case please let me know! Also, let me know if I wasn't clear anywhere :) |
Not developed, just tested.
I don't see any difference right now between the example at hand here for MLIR and most transformations in LLVM. |
The difference is that for the majority of cases, there is only one unit test that can be written for the lowering pattern: basically the test is the lowering pattern itself (i.e. the example shown). This is a problem for lowering passes only and not for other transformations. Because of this, it results in insufficient testing for the lowering passes imo. |
Why isn't it the case for a canonicalization pattern? Also what is different between LLVM instruction selection and a MLIR lowering pattern? |
You're right, unit tests don't test the correctness of any patterns - but integration tests would. Pattern correctness can also be a problem for canonicalization patterns, I remember quite a few bugs due to incorrect canonicalizers. But lowerings are easier to test in an integration test - it's easy to control when they're applied and they're applied in isolation. OTOH multiple canonicalization patterns may apply and are also applied greedily, so IMO it's harder to isolate them and write integration tests for (correct me if I'm wrong!).
I'm not very familiar with how they're tested so I cannot compare the two, sorry! |
Thanks for this feedback, Mehdi! I feel that we are missing some guidelines on what the expected scope of integration tests should be. My high-level suggestion (based on this discussion) would be that integration tests should be reserved for cases much more complex than e.g. (extracted from comparison.mlir that Mehdi referred to uptrhead):
Admittedly, that's not particularly involved. However, examples from expand-ops.mlir (including I am happy to work on a proposal, but will be OoO next two weeks (perhaps somebody else is free to volunteer?). In the meantime, we should probably pause this effort.
I don't follow this comment. Unit tests are there specifically to verify that the corresponding pattern is correct. In addition, every pattern is verified through code-review.
IMHO, in the example that you referred to, one "magic" formula for expanding
I am not convinced. I can see how an integration test would prevent us from hitting #83079, but we will never have an e2e for every possible scenario. That's just not feasible and against the grain of MLIR and LLVM. As for #83079, IMHO, we were (and still are) missing a formal spec for expanding While it sounds like we probably should revert some of the newly added e2e tests, I feel that overall we are making important progress in terms of test coverage for Arith. Thank you @pingshiyu ! |
@banach-space thanks for chiming in!
As far as I can see, unit tests don't verify that a pattern is correct: I mean correct in the sense that for a pattern
Oh yes absolutely! It was never the intention to replace formal semantics with e2e tests :) During working on this PR and conversation in this thread I realised that integration tests could fill in some testing gaps in the correctness of rewrite patterns, that are not covered by unit tests.
Indeed the formula (or, the correctness of the lowering pattern) is the part that wasn't able to be tested by a unit test - and an e2e test would be able to provide test coverage for it. In the absence of formal semantics, and for potential complex lowerings where formal semantics might be absent/arrive slowly, I think e2e tests could, like you said, be complementary to unit tests (they test orthogonal things) :) I am actually currently developing semi-formal semantics in Haskell for some MLIR operations. The integration tests in this suite were all "contentious" cases which are bugs on either the Haskell side during development or the MLIR side - an indication that these cases are "complex" and have the potential to go wrong. The sources of these bugs are mostly unsound optimisation patterns and lowerings - parts which aren't covered by MLIR unit testing |
Thanks for clarifying, now I see what you meant. Still, I think that one of Mehdi's previous comments is quite fitting here:
I will paraphrase - you are proposing a different approach towards testing compared to what's normally done in LLVM/MLIR. We could change that, sure, but it's a discussion beyond Arith and MLIR, and more suitable for LLVM Discourse.
Right, but I don't believe that we should be using MLIR integration tests to verify that a particular Mathematical formula is correct. Instead, we should select a formula that is known to be correct (surely there are some good references) and implement that. Testing in MLIR should focus on making sure that the implementation itself is correct, not the formula. Does it make sense? Having said that, IMHO, it would be totally fine to have e2e for things like
Exciting! It would be interesting to see how you make sure that the semantics in Haskell match MLIR (and whether there's scope for writing a test-suite). That's a completely different thread though 😅 |
Yes, thanks for highlighting this. I do understand now that this is a deviation from what people are used to. I'm not very familiar with tests on the LLVM side. Maybe we could work together on the integration testing guidelines proposal, as we seem to want changes in similar directions :) What other ways can I contact you to discuss this further? I am on Discord as well and my email is on my profile. I have concerns about raising this with the whole LLVM community because it is quite a substantial scope expansion compared to what I had in mind: I thought integration tests would be useful for rewrites in MLIR, and this PR is initiating that for Arith. There's a focus on lowering patterns here as they are easier to test using integration tests (hard to control what canonicalization patterns get applied) - and the extensive use of lowerings is characteristic of MLIR rather than LLVM.
I don't quite agree with this in the context of general testing approaches for MLIR. We can't expect to always have a well-known, or verified formula for all rewrite patterns. Even then, all rewrite patterns are implementations of some formula - and implementations can have bugs, which would benefit from validation using integration tests (since bugs in the patterns cannot be detected by unit tests). (Note by validation I mean a unit-testing-like approach rather than an exhaustive test suite for all cases: AFAIK patterns don't have any unit testing at all currently, and the question of whether they are correct, or correctly implement some formula, hinges almost entirely upon code reviews)
It's funny you mention this! xD The way I am making sure the semantics match is concurrently developing a fuzzer to generate MLIR programs (e.g. lots of random integration tests), and comparing the semantics against the MLIR compilation results. This very test suite in this PR actually comes from this process. Every test case here is a bug in either MLIR or in the Haskell semantics - a place where they disagreed at some point during development :) Finally, for this particular series of PRs (integration tests for Arith), I think the following are the value add:
But we should probably return to this once we get some feedback from the wider MLIR community on testing approaches :) |
Just a couple of clarifications.
What I had in mind in my earlier post was merely updating the MLIR docs to clarify what the expected scope of e2e tests should be. I think that you want to expand the scope of such tests. I am not convinced that that's the right approach.
I was referring specifically to |
Adds some guidelines on what qualifies as a good candidate for an e2e tests and in which cases to avoid integration tests. Follow-up from this discussion: * llvm/llvm-project#92272
@joker-eph and @pingshiyu, sorry for the delay.
Please take a look: llvm/mlir-www#203. Once the docs are updated, we should review the newly added e2e tests and remove cases already tested through other means. |
Adds some guidelines on what qualifies as a good candidate for an e2e tests and in which cases to avoid integration tests. Follow-up from this discussion: * llvm/llvm-project#92272 --------- Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
I am removing the recently added integration test for various Arith Ops. These operations and their lowerings are effectively already verified by the Arith-to-LLVM conversion tests in: * "mlir/test/Conversion/ArithToLLVM/arith-to-llvm.mlir" I've noticed that a few variants of arith.cmpi were missing in that file - those are added here as well. This is a follow-up for this discussion: * llvm#92272 See also the recent update to our guidelines on e2e tests in MLIR: * llvm/mlir-www#203
I am removing the recently added integration test for various Arith Ops. These operations and their lowerings are effectively already verified by the Arith-to-LLVM conversion tests in: * "mlir/test/Conversion/ArithToLLVM/arith-to-llvm.mlir" I've noticed that a few variants of `arith.cmpi` were missing in that file - those are added here as well. This is a follow-up for this discussion: * #92272 See also the recent update to our guidelines on e2e tests in MLIR: * llvm/mlir-www#203
I've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations.
It found a bunch of bugs in the
arith
folders, optimisations and lowering.However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter.
I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p).
The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the
Integration
folder. However, it could also be adapted as canonicalization tests (caveat being it can't exercise lowering code).I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests.
Please do let me know if adding these would be helpful, and what's the preferred format for them.
Once/if that's all agreed upon I can add the rest of the regression suite!