Skip to content

Conversation

@ellert
Copy link
Contributor

@ellert ellert commented Feb 28, 2023

This Pull request:

The testRooAbsL test compares two doubles and fails due to rounding errors.
The failure happens on ppc64le and aarch64.

Changes or fixes:

This PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test.

[ RUN      ] SimBinnedConstrainedTest.SubEventSections
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:293: Failure
Expected equality of these values:
  whole.Sum()
    Which is: -1263.796050661927
  N_events_total_parts.Sum()
    Which is: -1263.7960506619268
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:303: Failure
Expected equality of these values:
  whole.Sum()
    Which is: -1263.796050661927
  thrice_N_events_total_parts.Sum()
    Which is: -1263.7960506619268
[  FAILED  ] SimBinnedConstrainedTest.SubEventSections (199 ms)

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

@phsft-bot
Copy link

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@guitargeek
Copy link
Contributor

guitargeek commented Feb 28, 2023

I assigned this to @egpbos who wrote the test. Let's wait for his review before merging. I'm curious to hear his opinion, but as far as I know the EXPECT_EQ is intended here because the results are expected to be identical even bitwise. But maybe that is unreasonable to achieve and EXPECT_DOUBLE_EQ is sufficient?

@phsft-bot
Copy link

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

  • [2023-02-28T11:18:54.997Z] ghprbPullLongDescription=# This Pull request:\r\n\r\nThe testRooAbsL test compares two doubles and fails due to rounding errors.\r\nThe failure happens on ppc64le and aarch64.\r\n\r\n## Changes or fixes:\r\n\r\nThis PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test.\r\n~~~\r\n[ RUN ] SimBinnedConstrainedTest.SubEventSections\r\n/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:293: Failure\r\nExpected equality of these values:\r\n whole.Sum()\r\n Which is: -1263.796050661927\r\n N_events_total_parts.Sum()\r\n Which is: -1263.7960506619268\r\n/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:303: Failure\r\nExpected equality of these values:\r\n whole.Sum()\r\n Which is: -1263.796050661927\r\n thrice_N_events_total_parts.Sum()\r\n Which is: -1263.7960506619268\r\n[ FAILED ] SimBinnedConstrainedTest.SubEventSections (199 ms)\r\n~~~\r\n\r\n## Checklist:\r\n\r\n- [x] tested changes locally\r\n- [ ] updated the docs (if necessary)\r\n

@guitargeek guitargeek changed the title The testRooAbsL test compares two doubles and fails due to rounding errors [RF] The testRooAbsL test compares two doubles and fails due to rounding errors Mar 8, 2023
@egpbos
Copy link
Contributor

egpbos commented Apr 12, 2023

Sorry for the late reply!

I would like to understand why exactly the tests fail. We're dealing with summation errors like these all the time in RooFit. Imho, we need to understand as exactly as possible where they come from and then document that. If we cannot do that, then we should treat them as errors, until we understand the fundamental limits, e.g. from hardware summation implementations. I can well imagine that that is the root cause here, but I would like to see some hard confirmation of that.

The reason I'm so insistent on this is that I've spent the past year trying to unravel subtle bugs involving multiple summations that may either fail because of implementation issues (like e.g. was the case for KahanSum) or because of fundamental limits in algorithms (Kahan summation also still has errors) or because of fundamental hardware limits or single vs double vs long double precision, etcetera. Trying to pin down exactly what is going on is a huge pain.

Those are my 2 cents here, sorry for being a pain in the ass :)

@egpbos
Copy link
Contributor

egpbos commented Apr 12, 2023

Btw, one way I've "documented" one of these fundamental limits is by hardcoding some "random" numbers that failed and then testing those with EXPECT_DOUBLE_EQ (and maybe also EXPECT_NE to be even more precise), while another random number that does EXPECT_EQ match you could also hardcode. It's very tedious, but I think it makes our tests actually useful for tracing bugs in the future.

@hahnjo
Copy link
Member

hahnjo commented May 8, 2023

Hi, I was sent here from #12784 and I'm a bit worried that we have, for multiple months now, a test that is known to fail on non-x86 architectures.

My question would be: Why do you want bit-wise identical results for sums? EXPECT_DOUBLE_EQ doesn't allow arbitrarily large deviations, but only 4 ULPs which is enough to account for rounding in (well-behaved) algorithms, but will still catch most other problems.

As to where these come from, my primary suspect would be fused-multiply-add instructions or other optimized floating point instructions that are not fully IEEE-754 compliant. I don't know exactly where, but I also don't think that you want to disable them globally and pay the performance penalty...

@egpbos
Copy link
Contributor

egpbos commented May 8, 2023

I agree that it is an unfortunate circumstance to have the test not passing. However, I still urge you to consider the effect on debugability removing strict tests will have. I have wasted a lot of time trying to track down a subtle bug due to loose tests which left me unable to determine where the bug came from. Conversely, setting tests strictly and tracing discrepancies, I found a different bug in the Kahan sum: #11940. Unit tests especially should be as strict as possible. In absence of unit tests (which unfortunately is the case for the parts of RooFit I worked on), strict "integration" tests like these are the closest I could get.

Note also that it was simply part of my original assignment: to ensure users can trust the new parallelized methods, I built them to be bit-wise exactly equal to the old methods. So, I know from experience how tedious it is to trace down these bit-wise differences to their actual concrete source, but also think that because we are dealing with scientific software, precision, robustness, reliability and interpretability (of components and hence the whole) are important. Of course, performance is an important factor too, but it's just one of the aspects. We want the end-results of experiments to be accurate and explainable, right? I personally at least don't like when I have to sell a "because of floating point errors", because I've been bitten by them a few times now.

Now, I'm also well aware that the project only has a given amount of resources and I myself am currently more or less out of resources (I can spend only my free time), so my vote probably doesn't count strongly :) If I could be of more practical help in this, I would definitely be happy to, but I also don't have a non-x86 machine, so it's not feasible for me to do much right now. So, feel free to make a call on this as you all best see fit!

@egpbos
Copy link
Contributor

egpbos commented May 8, 2023

Btw, one additional thought on this accuracy/performance trade-off (also following earlier discussions in the PPP channel): perhaps a good way to "solve" this is by not hardcoding one trade-off choice, but by allowing the users to choose what's best for their particular problems. I could imagine users would like a fast-mode for iteration on designing RooFit models and a precision-mode for the final calculation, for instance.

@hahnjo
Copy link
Member

hahnjo commented May 9, 2023

This puts us into the unfortunate spot that the solution which is widely used across most other software packages dealing with this kind of situation (ie, allowing the rounding errors) is not acceptable, but at the same time nobody is willing to spend the (considerable amount of) time to find another solution that you would be happy with.

/cc @Axel-Naumann

@egpbos
Copy link
Contributor

egpbos commented May 9, 2023

Yep. And again, I completely realize that the reasonable thing for me to do in this case would be to spend the time to make my own wishes come true, but I don't have the resources right now unfortunately.

Another pragmatic, but kinda ugly solution could be to have a compile flag in there that switches between EXPECT_EQ and EXPECT_DOUBLE_EQ based on the architecture. Then everybody can at least get on with things and maybe someone will come along later to fix things.

Anyway, indeed, maybe Axel can decide :)

@Axel-Naumann
Copy link
Member

Axel-Naumann commented May 9, 2023

I cannot decide, that's up to you @egpbos and @lmoneta given the input by @hahnjo and me here. Please consider this:

IIUC you are interpreting this lack of bitwise equality as a bug. In that case our approach is to fix the failure asap. If "asap" isn't happening because reasons then we disable the test, and create a bug report about the test failure, reminding ourselves (you, @egpbos in this case) that the test needs to be fixed and re-enabled. "Disabling the test" can as well be a switch from equality to equality-with-range, as long as the GitHub issue is clear which commit needs to be reverted to reprouce the original test failure.

How does that sound?

@egpbos
Copy link
Contributor

egpbos commented May 9, 2023

Sounds good to me, let's do that. We can discuss in the next RooFit dev meeting how to handle this long term.

@hahnjo
Copy link
Member

hahnjo commented May 9, 2023

As to where these come from, my primary suspect would be fused-multiply-add instructions or other optimized floating point instructions that are not fully IEEE-754 compliant. I don't know exactly where, but I also don't think that you want to disable them globally and pay the performance penalty...

Yep: At least for the first difference in testRooAbsL that I hunted down, mac13arm has a fused instruction in RooAbsPdf::extendedTerm. If instead of

double extra = doOffset
? (expected - sumEntries) - sumEntries * (std::log(expected) - std::log(sumEntries))
: expected - sumEntries * std::log(expected);

I put

  printf("        expected = %.17g, log = %.17g\n", expected, std::log(expected));
  double test = sumEntries * std::log(expected);
  printf("        test  = %.17g\n", test);
  test = expected - test;
  printf("        test  = %.17g\n", test);
#if 0
  double extra = doOffset
      ? (expected - sumEntries) - sumEntries * (std::log(expected) - std::log(sumEntries))
      : expected - sumEntries * std::log(expected);
#endif
  double extra = expected - sumEntries * std::log(expected);
  printf("        extra = %.17g\n", extra);

I get the following output:

        test  = 2390.5943542960845
        test  = -1990.5943542960845
        extra = -1990.5943542960847

with the corresponding assembly code being (with some annotations; the fmsub is at ca518)

   ca4d0: 00 41 60 1e   fmov    d0, d8
   ca4d4: c9 1c 07 94   bl      0x2917f8 <_write+0x2917f8>                      # likely the call to std::log
   ca4d8: 0b 40 60 1e   fmov    d11, d0
   ca4dc: e0 07 00 fd   str     d0, [sp, #8]
   ca4e0: e8 03 00 fd   str     d8, [sp]
   ca4e4: 20 10 00 f0   adrp    x0, 0x2d1000 <RooAbsPdf::createNLL(RooAbsData&, RooLinkedList const&)+0x6e8>
   ca4e8: 00 78 36 91   add     x0, x0, #3486
   ca4ec: ff 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "expected = %.17g, log = %.17g"
   ca4f0: 6c 09 6a 1e   fmul    d12, d11, d10                                   # test = sumEntries * std::log(expected)
   ca4f4: ec 03 00 fd   str     d12, [sp]
   ca4f8: 33 10 00 f0   adrp    x19, 0x2d1000 <RooAbsPdf::createNLL(RooAbsData&, RooLinkedList const&)+0x6fc>
   ca4fc: 73 16 37 91   add     x19, x19, #3525
   ca500: e0 03 13 aa   mov     x0, x19
   ca504: f9 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "test  = %.17g"
   ca508: 00 39 6c 1e   fsub    d0, d8, d12                                     # test = expected - test;
   ca50c: e0 03 00 fd   str     d0, [sp]
   ca510: e0 03 13 aa   mov     x0, x19
   ca514: f5 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "test  = %.17g"
   ca518: 4b a1 4b 1f   fmsub   d11, d10, d11, d8                               # extra = expected - sumEntries * std::log(expected)
   ca51c: eb 03 00 fd   str     d11, [sp]
   ca520: 20 10 00 f0   adrp    x0, 0x2d1000 <RooAbsPdf::createNLL(RooAbsData&, RooLinkedList const&)+0x724>
   ca524: 00 70 37 91   add     x0, x0, #3548
   ca528: f0 1c 07 94   bl      0x2918e8 <_write+0x2918e8>                      # print "extra = %.17g"
   ca52c: 28 21 60 1e   fcmp    d9, #0.0
   ca530: 60 00 00 54   b.eq    0xca53c <RooAbsPdf::extendedTerm(double, double, double, bool) const+0x29c>
   ca534: 20 19 6a 1e   fdiv    d0, d9, d10                                     # this is the branch sumEntriesW2 != 0.0
   ca538: 6b 09 60 1e   fmul    d11, d11, d0                                    # with a division and multiplication
   ca53c: 60 41 60 1e   fmov    d0, d11                                         # store the return value into d0
   ca540: fd 7b 47 a9   ldp     x29, x30, [sp, #112]
   ca544: f4 4f 46 a9   ldp     x20, x19, [sp, #96]
   ca548: f6 57 45 a9   ldp     x22, x21, [sp, #80]
   ca54c: e9 23 44 6d   ldp     d9, d8, [sp, #64]
   ca550: eb 2b 43 6d   ldp     d11, d10, [sp, #48]
   ca554: ed 33 42 6d   ldp     d13, d12, [sp, #32]
   ca558: ff 03 02 91   add     sp, sp, #128
   ca55c: c0 03 5f d6   ret

There are other problems with expecting bit-wise identical output across platforms. For example std::exp and std::log can return different results for certain input parameters, depending on how they are implemented and rounding of the result.

@egpbos
Copy link
Contributor

egpbos commented May 11, 2023

@hahnjo hm that's an interesting clue, but then I still don't fully understand why RooFit::TestStatistics::RooUnbinnedL would give a different answer, because it also just calls RooAbsPdf::extendedTerm, so it should still get the exact same result...

Or is the problem that the FMA operation on the different parts of the sum (the likelihood is calculated over multiple event ranges in the test that is failing, which are summed afterwards) has higher rounding errors on non-x86 so that in the end the result indeed differs?

Or, even more problematically, could such FMA operations also cause similar rounding errors on x86 so that EXPECT_EQ indeed becomes impossible (and I just got lucky with the current parameters on x86)?

@hahnjo
Copy link
Member

hahnjo commented May 12, 2023

@hahnjo hm that's an interesting clue, but then I still don't fully understand why RooFit::TestStatistics::RooUnbinnedL would give a different answer, because it also just calls RooAbsPdf::extendedTerm, so it should still get the exact same result...

Yes, you are absolutely right, I got carried away by finding the first / one of the differences by comparing x86 to mac13arm. Looking some more why RooUnbinnedL gives two different results on the same platform, it is actually closely related: in RooUnbinnedL::evaluatePartition, the extendedTerm is added to the first partition, if events.begin_fraction == 0. With one big partition, this is added at the very end, just before returning. For the sub event sections, it is added to the first partition and the other partitions are later added on top. This exhibits different rounding, which is triggered by mac13arm using the FMA instructions and having a different extendedTerm than x86.

One solution to this would be to add the extendedTerm to the last partition, that is if events.end_fraction == 1. This fixes SimBinnedConstrainedTest.SubEventSections on mac13arm, but makes it fail on x86 (didn't investigate why). It also makes SimBinnedConstrainedTest.EventSections fail (both on mac13arm and x86) - this test looks even worse in terms of floating point arithmetic, since it expects summing up two partitions individually and then adding the result being bitwise identical to summing up the whole range at once. From what I can see, this is only working by chance right now.

Or is the problem that the FMA operation on the different parts of the sum (the likelihood is calculated over multiple event ranges in the test that is failing, which are summed afterwards) has higher rounding errors on non-x86 so that in the end the result indeed differs?

See above; the additional problem is that the (optimizing) compiler will insert FMA operations if it sees fit (and is allowed to do so). So depending on how the code and the arithmetic operations are formulated, you may get different results depending on platform, software versions, optimization level, hardware, etc.

Or, even more problematically, could such FMA operations also cause similar rounding errors on x86 so that EXPECT_EQ indeed becomes impossible (and I just got lucky with the current parameters on x86)?

On x86, you are saved by the fact that, as far as I can tell, FMA instructions only come from some SIMD extension that the compiler cannot use unconditionally. This is different on AArch64 (and apparently also PPC64LE) where it appears it is in the base ISA and always available. But yes, if in the future FMA is always available on x86 or you compile (parts of) the code with vector instructions (RooBatchCompute, hint hint), you can potentially run into the same problem. For another point regarding "lucky", see above that adding the extendedTerm to the last partition already makes one other test fail...

In conclusion, I urge you to reconsider expecting bit-wise identical results. I think I've provided plenty of evidence why it's not a good idea, I don't know what more to say...

@hahnjo
Copy link
Member

hahnjo commented May 16, 2023

@egpbos @guitargeek is this ok to merge then?

@egpbos
Copy link
Contributor

egpbos commented May 17, 2023

You convinced me that the subevent-section tests should be EXPECT_DOUBLE_EQ.

From what I can see, this is only working by chance right now.

You are right! This can easily be confirmed (even just on my x86 laptop) by changing the seed on line 66. Setting it to some other values, the SimBinnedConstrainedTest.EventSections fails sometimes (e.g. seed = 24) and SimBinnedConstrainedTest.SubEventSections as well (e.g. seed = 255). Some seeds even make multiple tests fail, e.g. 25534 fails BinnedDatasetTest.EventSections and SimBinnedConstrainedTest.EventSections and 2 makes both SimBinnedConstrainedTest event-splitting tests fail.

I was not able to find a seed that fails the other tests, so it seems to be contained to the *EventSections tests.

To be complete, I think we should add the explanation of why this comparison doesn't work, which is that the calculations are, in fact, different. We're comparing a single-section ("all events") calculation to multi-section ("events section 1" + "section 2" + ...). Obviously, this can give differences due to rounding from FMA or just Kahan summation which also has a small error sometimes.

In other tests in the RooAbsL and other RooFit::TestStatistics suite, this doesn't typically apply. In most of the tests, we're trying to do the exact same calculation ("same" in terms of CPU operations as well), except either in parallel or through the TestStatistics classes instead of the old RooNLLVar tree. Sometimes different ordering can change results, like the extended term you mentioned, but also the subsidiary term.

Anyway, for this PR: I will suggest an explanation comment and after that it can be merged (@guitargeek agrees, we discussed it in the RooFit dev meeting just now). I will also open an issue reminding us to also apply this change to the two EventSections tests.

The failure happens on ppc64le and aarch64.

This PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test.

[ RUN      ] SimBinnedConstrainedTest.SubEventSections
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:293: Failure
Expected equality of these values:
  whole.Sum()
    Which is: -1263.796050661927
  N_events_total_parts.Sum()
    Which is: -1263.7960506619268
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:303: Failure
Expected equality of these values:
  whole.Sum()
    Which is: -1263.796050661927
  thrice_N_events_total_parts.Sum()
    Which is: -1263.7960506619268
[  FAILED  ] SimBinnedConstrainedTest.SubEventSections (199 ms)
@guitargeek guitargeek force-pushed the roofit-double-equal-check-fail branch from d6a11ac to 3e0e63a Compare May 22, 2023 17:41
@phsft-bot
Copy link

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

Copy link
Contributor

@guitargeek guitargeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very thorough discussion! I amended the comment suggested by @egpbos in this PR, and will merge it one the CI passes.

@github-actions
Copy link

Test Results

         6 files           6 suites   1d 13h 30m 24s ⏱️
  2 428 tests   2 396 ✔️ 0 💤 32
14 397 runs  14 365 ✔️ 0 💤 32

For more details on these failures, see this check.

Results for commit 3e0e63a.

@phsft-bot
Copy link

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Failing tests:

@guitargeek guitargeek merged commit b70fcdb into root-project:master May 22, 2023
@ellert ellert deleted the roofit-double-equal-check-fail branch May 23, 2023 01:41
@hahnjo hahnjo mentioned this pull request May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants