[RF] The testRooAbsL test compares two doubles and fails due to rounding errors #12389

ellert · 2023-02-28T10:12:20Z

This Pull request:

The testRooAbsL test compares two doubles and fails due to rounding errors.
The failure happens on ppc64le and aarch64.

Changes or fixes:

This PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test.

[ RUN      ] SimBinnedConstrainedTest.SubEventSections
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:293: Failure
Expected equality of these values:
  whole.Sum()
    Which is: -1263.796050661927
  N_events_total_parts.Sum()
    Which is: -1263.7960506619268
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:303: Failure
Expected equality of these values:
  whole.Sum()
    Which is: -1263.796050661927
  thrice_N_events_total_parts.Sum()
    Which is: -1263.7960506619268
[  FAILED  ] SimBinnedConstrainedTest.SubEventSections (199 ms)

Checklist:

tested changes locally
updated the docs (if necessary)

phsft-bot · 2023-02-28T10:12:29Z

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

guitargeek · 2023-02-28T10:50:36Z

I assigned this to @egpbos who wrote the test. Let's wait for his review before merging. I'm curious to hear his opinion, but as far as I know the EXPECT_EQ is intended here because the results are expected to be identical even bitwise. But maybe that is unreasonable to achieve and EXPECT_DOUBLE_EQ is sufficient?

phsft-bot · 2023-02-28T11:37:54Z

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

[2023-02-28T11:18:54.997Z] ghprbPullLongDescription=# This Pull request:\r\n\r\nThe testRooAbsL test compares two doubles and fails due to rounding errors.\r\nThe failure happens on ppc64le and aarch64.\r\n\r\n## Changes or fixes:\r\n\r\nThis PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test.\r\n~~~\r\n[ RUN ] SimBinnedConstrainedTest.SubEventSections\r\n/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:293: Failure\r\nExpected equality of these values:\r\n whole.Sum()\r\n Which is: -1263.796050661927\r\n N_events_total_parts.Sum()\r\n Which is: -1263.7960506619268\r\n/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:303: Failure\r\nExpected equality of these values:\r\n whole.Sum()\r\n Which is: -1263.796050661927\r\n thrice_N_events_total_parts.Sum()\r\n Which is: -1263.7960506619268\r\n[ FAILED ] SimBinnedConstrainedTest.SubEventSections (199 ms)\r\n~~~\r\n\r\n## Checklist:\r\n\r\n- [x] tested changes locally\r\n- [ ] updated the docs (if necessary)\r\n

egpbos · 2023-04-12T10:48:50Z

Sorry for the late reply!

I would like to understand why exactly the tests fail. We're dealing with summation errors like these all the time in RooFit. Imho, we need to understand as exactly as possible where they come from and then document that. If we cannot do that, then we should treat them as errors, until we understand the fundamental limits, e.g. from hardware summation implementations. I can well imagine that that is the root cause here, but I would like to see some hard confirmation of that.

The reason I'm so insistent on this is that I've spent the past year trying to unravel subtle bugs involving multiple summations that may either fail because of implementation issues (like e.g. was the case for KahanSum) or because of fundamental limits in algorithms (Kahan summation also still has errors) or because of fundamental hardware limits or single vs double vs long double precision, etcetera. Trying to pin down exactly what is going on is a huge pain.

Those are my 2 cents here, sorry for being a pain in the ass :)

egpbos · 2023-04-12T10:52:48Z

Btw, one way I've "documented" one of these fundamental limits is by hardcoding some "random" numbers that failed and then testing those with EXPECT_DOUBLE_EQ (and maybe also EXPECT_NE to be even more precise), while another random number that does EXPECT_EQ match you could also hardcode. It's very tedious, but I think it makes our tests actually useful for tracing bugs in the future.

hahnjo · 2023-05-08T15:12:35Z

Hi, I was sent here from #12784 and I'm a bit worried that we have, for multiple months now, a test that is known to fail on non-x86 architectures.

My question would be: Why do you want bit-wise identical results for sums? EXPECT_DOUBLE_EQ doesn't allow arbitrarily large deviations, but only 4 ULPs which is enough to account for rounding in (well-behaved) algorithms, but will still catch most other problems.

As to where these come from, my primary suspect would be fused-multiply-add instructions or other optimized floating point instructions that are not fully IEEE-754 compliant. I don't know exactly where, but I also don't think that you want to disable them globally and pay the performance penalty...

egpbos · 2023-05-08T17:24:59Z

I agree that it is an unfortunate circumstance to have the test not passing. However, I still urge you to consider the effect on debugability removing strict tests will have. I have wasted a lot of time trying to track down a subtle bug due to loose tests which left me unable to determine where the bug came from. Conversely, setting tests strictly and tracing discrepancies, I found a different bug in the Kahan sum: #11940. Unit tests especially should be as strict as possible. In absence of unit tests (which unfortunately is the case for the parts of RooFit I worked on), strict "integration" tests like these are the closest I could get.

Note also that it was simply part of my original assignment: to ensure users can trust the new parallelized methods, I built them to be bit-wise exactly equal to the old methods. So, I know from experience how tedious it is to trace down these bit-wise differences to their actual concrete source, but also think that because we are dealing with scientific software, precision, robustness, reliability and interpretability (of components and hence the whole) are important. Of course, performance is an important factor too, but it's just one of the aspects. We want the end-results of experiments to be accurate and explainable, right? I personally at least don't like when I have to sell a "because of floating point errors", because I've been bitten by them a few times now.

Now, I'm also well aware that the project only has a given amount of resources and I myself am currently more or less out of resources (I can spend only my free time), so my vote probably doesn't count strongly :) If I could be of more practical help in this, I would definitely be happy to, but I also don't have a non-x86 machine, so it's not feasible for me to do much right now. So, feel free to make a call on this as you all best see fit!

egpbos · 2023-05-08T17:33:55Z

Btw, one additional thought on this accuracy/performance trade-off (also following earlier discussions in the PPP channel): perhaps a good way to "solve" this is by not hardcoding one trade-off choice, but by allowing the users to choose what's best for their particular problems. I could imagine users would like a fast-mode for iteration on designing RooFit models and a precision-mode for the final calculation, for instance.

hahnjo · 2023-05-09T06:57:37Z

This puts us into the unfortunate spot that the solution which is widely used across most other software packages dealing with this kind of situation (ie, allowing the rounding errors) is not acceptable, but at the same time nobody is willing to spend the (considerable amount of) time to find another solution that you would be happy with.

/cc @Axel-Naumann

egpbos · 2023-05-09T10:15:15Z

Yep. And again, I completely realize that the reasonable thing for me to do in this case would be to spend the time to make my own wishes come true, but I don't have the resources right now unfortunately.

Another pragmatic, but kinda ugly solution could be to have a compile flag in there that switches between EXPECT_EQ and EXPECT_DOUBLE_EQ based on the architecture. Then everybody can at least get on with things and maybe someone will come along later to fix things.

Anyway, indeed, maybe Axel can decide :)

Axel-Naumann · 2023-05-09T11:50:26Z

I cannot decide, that's up to you @egpbos and @lmoneta given the input by @hahnjo and me here. Please consider this:

IIUC you are interpreting this lack of bitwise equality as a bug. In that case our approach is to fix the failure asap. If "asap" isn't happening because reasons then we disable the test, and create a bug report about the test failure, reminding ourselves (you, @egpbos in this case) that the test needs to be fixed and re-enabled. "Disabling the test" can as well be a switch from equality to equality-with-range, as long as the GitHub issue is clear which commit needs to be reverted to reprouce the original test failure.

How does that sound?

egpbos · 2023-05-09T12:52:29Z

Sounds good to me, let's do that. We can discuss in the next RooFit dev meeting how to handle this long term.

hahnjo · 2023-05-09T14:29:18Z

As to where these come from, my primary suspect would be fused-multiply-add instructions or other optimized floating point instructions that are not fully IEEE-754 compliant. I don't know exactly where, but I also don't think that you want to disable them globally and pay the performance penalty...

Yep: At least for the first difference in testRooAbsL that I hunted down, mac13arm has a fused instruction in RooAbsPdf::extendedTerm. If instead of

root/roofit/roofitcore/src/RooAbsPdf.cxx

Lines 820 to 822 in b7b8646

    
           double extra = doOffset 
        
               ? (expected - sumEntries) - sumEntries * (std::log(expected) - std::log(sumEntries)) 
        
               : expected - sumEntries * std::log(expected);

I put

  printf("        expected = %.17g, log = %.17g\n", expected, std::log(expected));
  double test = sumEntries * std::log(expected);
  printf("        test  = %.17g\n", test);
  test = expected - test;
  printf("        test  = %.17g\n", test);
#if 0
  double extra = doOffset
      ? (expected - sumEntries) - sumEntries * (std::log(expected) - std::log(sumEntries))
      : expected - sumEntries * std::log(expected);
#endif
  double extra = expected - sumEntries * std::log(expected);
  printf("        extra = %.17g\n", extra);

I get the following output:

        test  = 2390.5943542960845
        test  = -1990.5943542960845
        extra = -1990.5943542960847

with the corresponding assembly code being (with some annotations; the fmsub is at ca518)

   ca4d0: 00 41 60 1e   fmov    d0, d8
   ca4d4: c9 1c 07 94   bl      0x2917f8 <_write+0x2917f8>                      # likely the call to std::log
   ca4d8: 0b 40 60 1e   fmov    d11, d0
   ca4dc: e0 07 00 fd   str     d0, [sp, #8]
   ca4e0: e8 03 00 fd   str     d8, [sp]
   ca4e4: 20 10 00 f0   adrp    x0, 0x2d1000 <RooAbsPdf::createNLL(RooAbsData&, RooLinkedList const&)+0x6e8>
   ca4e8: 00 78 36 91   add     x0, x0, #3486
   ca4ec: ff 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "expected = %.17g, log = %.17g"
   ca4f0: 6c 09 6a 1e   fmul    d12, d11, d10                                   # test = sumEntries * std::log(expected)
   ca4f4: ec 03 00 fd   str     d12, [sp]
   ca4f8: 33 10 00 f0   adrp    x19, 0x2d1000 <RooAbsPdf::createNLL(RooAbsData&, RooLinkedList const&)+0x6fc>
   ca4fc: 73 16 37 91   add     x19, x19, #3525
   ca500: e0 03 13 aa   mov     x0, x19
   ca504: f9 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "test  = %.17g"
   ca508: 00 39 6c 1e   fsub    d0, d8, d12                                     # test = expected - test;
   ca50c: e0 03 00 fd   str     d0, [sp]
   ca510: e0 03 13 aa   mov     x0, x19
   ca514: f5 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "test  = %.17g"
   ca518: 4b a1 4b 1f   fmsub   d11, d10, d11, d8                               # extra = expected - sumEntries * std::log(expected)
   ca51c: eb 03 00 fd   str     d11, [sp]
   ca520: 20 10 00 f0   adrp    x0, 0x2d1000 <RooAbsPdf::createNLL(RooAbsData&, RooLinkedList const&)+0x724>
   ca524: 00 70 37 91   add     x0, x0, #3548
   ca528: f0 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "extra = %.17g"
   ca52c: 28 21 60 1e   fcmp    d9, #0.0
   ca530: 60 00 00 54   b.eq    0xca53c <RooAbsPdf::extendedTerm(double, double, double, bool) const+0x29c>
   ca534: 20 19 6a 1e   fdiv    d0, d9, d10                                     # this is the branch sumEntriesW2 != 0.0
   ca538: 6b 09 60 1e   fmul    d11, d11, d0                                    # with a division and multiplication
   ca53c: 60 41 60 1e   fmov    d0, d11                                         # store the return value into d0
   ca540: fd 7b 47 a9   ldp     x29, x30, [sp, #112]
   ca544: f4 4f 46 a9   ldp     x20, x19, [sp, #96]
   ca548: f6 57 45 a9   ldp     x22, x21, [sp, #80]
   ca54c: e9 23 44 6d   ldp     d9, d8, [sp, #64]
   ca550: eb 2b 43 6d   ldp     d11, d10, [sp, #48]
   ca554: ed 33 42 6d   ldp     d13, d12, [sp, #32]
   ca558: ff 03 02 91   add     sp, sp, #128
   ca55c: c0 03 5f d6   ret

There are other problems with expecting bit-wise identical output across platforms. For example std::exp and std::log can return different results for certain input parameters, depending on how they are implemented and rounding of the result.

egpbos · 2023-05-11T17:06:02Z

@hahnjo hm that's an interesting clue, but then I still don't fully understand why RooFit::TestStatistics::RooUnbinnedL would give a different answer, because it also just calls RooAbsPdf::extendedTerm, so it should still get the exact same result...

Or is the problem that the FMA operation on the different parts of the sum (the likelihood is calculated over multiple event ranges in the test that is failing, which are summed afterwards) has higher rounding errors on non-x86 so that in the end the result indeed differs?

Or, even more problematically, could such FMA operations also cause similar rounding errors on x86 so that EXPECT_EQ indeed becomes impossible (and I just got lucky with the current parameters on x86)?

hahnjo · 2023-05-12T08:02:39Z

@hahnjo hm that's an interesting clue, but then I still don't fully understand why RooFit::TestStatistics::RooUnbinnedL would give a different answer, because it also just calls RooAbsPdf::extendedTerm, so it should still get the exact same result...

Yes, you are absolutely right, I got carried away by finding the first / one of the differences by comparing x86 to mac13arm. Looking some more why RooUnbinnedL gives two different results on the same platform, it is actually closely related: in RooUnbinnedL::evaluatePartition, the extendedTerm is added to the first partition, if events.begin_fraction == 0. With one big partition, this is added at the very end, just before returning. For the sub event sections, it is added to the first partition and the other partitions are later added on top. This exhibits different rounding, which is triggered by mac13arm using the FMA instructions and having a different extendedTerm than x86.

One solution to this would be to add the extendedTerm to the last partition, that is if events.end_fraction == 1. This fixes SimBinnedConstrainedTest.SubEventSections on mac13arm, but makes it fail on x86 (didn't investigate why). It also makes SimBinnedConstrainedTest.EventSections fail (both on mac13arm and x86) - this test looks even worse in terms of floating point arithmetic, since it expects summing up two partitions individually and then adding the result being bitwise identical to summing up the whole range at once. From what I can see, this is only working by chance right now.

Or is the problem that the FMA operation on the different parts of the sum (the likelihood is calculated over multiple event ranges in the test that is failing, which are summed afterwards) has higher rounding errors on non-x86 so that in the end the result indeed differs?

See above; the additional problem is that the (optimizing) compiler will insert FMA operations if it sees fit (and is allowed to do so). So depending on how the code and the arithmetic operations are formulated, you may get different results depending on platform, software versions, optimization level, hardware, etc.

Or, even more problematically, could such FMA operations also cause similar rounding errors on x86 so that EXPECT_EQ indeed becomes impossible (and I just got lucky with the current parameters on x86)?

On x86, you are saved by the fact that, as far as I can tell, FMA instructions only come from some SIMD extension that the compiler cannot use unconditionally. This is different on AArch64 (and apparently also PPC64LE) where it appears it is in the base ISA and always available. But yes, if in the future FMA is always available on x86 or you compile (parts of) the code with vector instructions (RooBatchCompute, hint hint), you can potentially run into the same problem. For another point regarding "lucky", see above that adding the extendedTerm to the last partition already makes one other test fail...

In conclusion, I urge you to reconsider expecting bit-wise identical results. I think I've provided plenty of evidence why it's not a good idea, I don't know what more to say...

hahnjo · 2023-05-16T11:29:26Z

@egpbos @guitargeek is this ok to merge then?

egpbos · 2023-05-17T08:38:08Z

You convinced me that the subevent-section tests should be EXPECT_DOUBLE_EQ.

From what I can see, this is only working by chance right now.

You are right! This can easily be confirmed (even just on my x86 laptop) by changing the seed on line 66. Setting it to some other values, the SimBinnedConstrainedTest.EventSections fails sometimes (e.g. seed = 24) and SimBinnedConstrainedTest.SubEventSections as well (e.g. seed = 255). Some seeds even make multiple tests fail, e.g. 25534 fails BinnedDatasetTest.EventSections and SimBinnedConstrainedTest.EventSections and 2 makes both SimBinnedConstrainedTest event-splitting tests fail.

I was not able to find a seed that fails the other tests, so it seems to be contained to the *EventSections tests.

To be complete, I think we should add the explanation of why this comparison doesn't work, which is that the calculations are, in fact, different. We're comparing a single-section ("all events") calculation to multi-section ("events section 1" + "section 2" + ...). Obviously, this can give differences due to rounding from FMA or just Kahan summation which also has a small error sometimes.

In other tests in the RooAbsL and other RooFit::TestStatistics suite, this doesn't typically apply. In most of the tests, we're trying to do the exact same calculation ("same" in terms of CPU operations as well), except either in parallel or through the TestStatistics classes instead of the old RooNLLVar tree. Sometimes different ordering can change results, like the extended term you mentioned, but also the subsidiary term.

Anyway, for this PR: I will suggest an explanation comment and after that it can be merged (@guitargeek agrees, we discussed it in the RooFit dev meeting just now). I will also open an issue reminding us to also apply this change to the two EventSections tests.

roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx

The failure happens on ppc64le and aarch64. This PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test. [ RUN ] SimBinnedConstrainedTest.SubEventSections /builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:293: Failure Expected equality of these values: whole.Sum() Which is: -1263.796050661927 N_events_total_parts.Sum() Which is: -1263.7960506619268 /builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:303: Failure Expected equality of these values: whole.Sum() Which is: -1263.796050661927 thrice_N_events_total_parts.Sum() Which is: -1263.7960506619268 [ FAILED ] SimBinnedConstrainedTest.SubEventSections (199 ms)

phsft-bot · 2023-05-22T17:41:14Z

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

guitargeek

Thanks for the very thorough discussion! I amended the comment suggested by @egpbos in this PR, and will merge it one the CI passes.

phsft-bot · 2023-05-22T18:20:36Z

Build failed on mac12/noimt.
Running on macphsft17.dyndns.cern.ch:/Users/sftnight/build/jenkins/workspace/root-pullrequests-build
See console output.

Failing tests:

github-actions · 2023-05-22T19:06:38Z

Test Results

        6 files         6 suites 1d 13h 30m 24s ⏱️
  2 428 tests   2 396 ✔️ 0 💤 32 ❌
14 397 runs 14 365 ✔️ 0 💤 32 ❌

For more details on these failures, see this check.

Results for commit 3e0e63a.

phsft-bot · 2023-05-22T19:28:10Z

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Failing tests:

phsft-bot · 2023-05-22T21:23:49Z

Build failed on mac11/cxx14.
Running on macphsft23.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

ellert added affects:master in:RooFit/RooStats in:Testing affects:6.28 labels Feb 28, 2023

ellert requested review from guitargeek, hageboeck and lmoneta as code owners February 28, 2023 10:12

guitargeek assigned egpbos Feb 28, 2023

guitargeek changed the title ~~The testRooAbsL test compares two doubles and fails due to rounding errors~~ [RF] The testRooAbsL test compares two doubles and fails due to rounding errors Mar 8, 2023

guitargeek mentioned this pull request May 8, 2023

Test fixes for mac13arm #12784

Merged

egpbos reviewed May 17, 2023

View reviewed changes

roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx Show resolved Hide resolved

egpbos mentioned this pull request May 17, 2023

[RF] Change precision of RooAbsL EventSections tests #12832

Closed

guitargeek force-pushed the roofit-double-equal-check-fail branch from d6a11ac to 3e0e63a Compare May 22, 2023 17:41

guitargeek approved these changes May 22, 2023

View reviewed changes

guitargeek merged commit b70fcdb into root-project:master May 22, 2023

ellert deleted the roofit-double-equal-check-fail branch May 23, 2023 01:41

hahnjo mentioned this pull request May 30, 2023

Fixes for v6.28 #12899

Merged

[RF] The testRooAbsL test compares two doubles and fails due to rounding errors #12389

[RF] The testRooAbsL test compares two doubles and fails due to rounding errors #12389

Uh oh!

Conversation

ellert commented Feb 28, 2023

This Pull request:

Changes or fixes:

Checklist:

Uh oh!

phsft-bot commented Feb 28, 2023

Uh oh!

guitargeek commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phsft-bot commented Feb 28, 2023

Errors:

Uh oh!

egpbos commented Apr 12, 2023

Uh oh!

egpbos commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hahnjo commented May 8, 2023

Uh oh!

egpbos commented May 8, 2023

Uh oh!

egpbos commented May 8, 2023

Uh oh!

hahnjo commented May 9, 2023

Uh oh!

egpbos commented May 9, 2023

Uh oh!

Axel-Naumann commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egpbos commented May 9, 2023

Uh oh!

hahnjo commented May 9, 2023

Uh oh!

egpbos commented May 11, 2023

Uh oh!

hahnjo commented May 12, 2023

Uh oh!

hahnjo commented May 16, 2023

Uh oh!

egpbos commented May 17, 2023

Uh oh!

Uh oh!

phsft-bot commented May 22, 2023

Uh oh!

guitargeek left a comment

Choose a reason for hiding this comment

Uh oh!

phsft-bot commented May 22, 2023

Failing tests:

Uh oh!

github-actions bot commented May 22, 2023

Test Results

Uh oh!

phsft-bot commented May 22, 2023

Failing tests:

Uh oh!

phsft-bot commented May 22, 2023

Failing tests:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

guitargeek commented Feb 28, 2023 •

edited

Loading

egpbos commented Apr 12, 2023 •

edited

Loading

Axel-Naumann commented May 9, 2023 •

edited

Loading