-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[RF] The testRooAbsL test compares two doubles and fails due to rounding errors #12389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RF] The testRooAbsL test compares two doubles and fails due to rounding errors #12389
Conversation
|
Starting build on |
|
I assigned this to @egpbos who wrote the test. Let's wait for his review before merging. I'm curious to hear his opinion, but as far as I know the |
|
Build failed on windows10/cxx14. Errors:
|
|
Sorry for the late reply! I would like to understand why exactly the tests fail. We're dealing with summation errors like these all the time in RooFit. Imho, we need to understand as exactly as possible where they come from and then document that. If we cannot do that, then we should treat them as errors, until we understand the fundamental limits, e.g. from hardware summation implementations. I can well imagine that that is the root cause here, but I would like to see some hard confirmation of that. The reason I'm so insistent on this is that I've spent the past year trying to unravel subtle bugs involving multiple summations that may either fail because of implementation issues (like e.g. was the case for KahanSum) or because of fundamental limits in algorithms (Kahan summation also still has errors) or because of fundamental hardware limits or single vs double vs long double precision, etcetera. Trying to pin down exactly what is going on is a huge pain. Those are my 2 cents here, sorry for being a pain in the ass :) |
|
Btw, one way I've "documented" one of these fundamental limits is by hardcoding some "random" numbers that failed and then testing those with EXPECT_DOUBLE_EQ (and maybe also EXPECT_NE to be even more precise), while another random number that does EXPECT_EQ match you could also hardcode. It's very tedious, but I think it makes our tests actually useful for tracing bugs in the future. |
|
Hi, I was sent here from #12784 and I'm a bit worried that we have, for multiple months now, a test that is known to fail on non-x86 architectures. My question would be: Why do you want bit-wise identical results for sums? As to where these come from, my primary suspect would be fused-multiply-add instructions or other optimized floating point instructions that are not fully IEEE-754 compliant. I don't know exactly where, but I also don't think that you want to disable them globally and pay the performance penalty... |
|
I agree that it is an unfortunate circumstance to have the test not passing. However, I still urge you to consider the effect on debugability removing strict tests will have. I have wasted a lot of time trying to track down a subtle bug due to loose tests which left me unable to determine where the bug came from. Conversely, setting tests strictly and tracing discrepancies, I found a different bug in the Kahan sum: #11940. Unit tests especially should be as strict as possible. In absence of unit tests (which unfortunately is the case for the parts of RooFit I worked on), strict "integration" tests like these are the closest I could get. Note also that it was simply part of my original assignment: to ensure users can trust the new parallelized methods, I built them to be bit-wise exactly equal to the old methods. So, I know from experience how tedious it is to trace down these bit-wise differences to their actual concrete source, but also think that because we are dealing with scientific software, precision, robustness, reliability and interpretability (of components and hence the whole) are important. Of course, performance is an important factor too, but it's just one of the aspects. We want the end-results of experiments to be accurate and explainable, right? I personally at least don't like when I have to sell a "because of floating point errors", because I've been bitten by them a few times now. Now, I'm also well aware that the project only has a given amount of resources and I myself am currently more or less out of resources (I can spend only my free time), so my vote probably doesn't count strongly :) If I could be of more practical help in this, I would definitely be happy to, but I also don't have a non-x86 machine, so it's not feasible for me to do much right now. So, feel free to make a call on this as you all best see fit! |
|
Btw, one additional thought on this accuracy/performance trade-off (also following earlier discussions in the PPP channel): perhaps a good way to "solve" this is by not hardcoding one trade-off choice, but by allowing the users to choose what's best for their particular problems. I could imagine users would like a fast-mode for iteration on designing RooFit models and a precision-mode for the final calculation, for instance. |
|
This puts us into the unfortunate spot that the solution which is widely used across most other software packages dealing with this kind of situation (ie, allowing the rounding errors) is not acceptable, but at the same time nobody is willing to spend the (considerable amount of) time to find another solution that you would be happy with. /cc @Axel-Naumann |
|
Yep. And again, I completely realize that the reasonable thing for me to do in this case would be to spend the time to make my own wishes come true, but I don't have the resources right now unfortunately. Another pragmatic, but kinda ugly solution could be to have a compile flag in there that switches between EXPECT_EQ and EXPECT_DOUBLE_EQ based on the architecture. Then everybody can at least get on with things and maybe someone will come along later to fix things. Anyway, indeed, maybe Axel can decide :) |
|
I cannot decide, that's up to you @egpbos and @lmoneta given the input by @hahnjo and me here. Please consider this: IIUC you are interpreting this lack of bitwise equality as a bug. In that case our approach is to fix the failure asap. If "asap" isn't happening because reasons then we disable the test, and create a bug report about the test failure, reminding ourselves (you, @egpbos in this case) that the test needs to be fixed and re-enabled. "Disabling the test" can as well be a switch from equality to equality-with-range, as long as the GitHub issue is clear which commit needs to be reverted to reprouce the original test failure. How does that sound? |
|
Sounds good to me, let's do that. We can discuss in the next RooFit dev meeting how to handle this long term. |
Yep: At least for the first difference in root/roofit/roofitcore/src/RooAbsPdf.cxx Lines 820 to 822 in b7b8646
I put printf(" expected = %.17g, log = %.17g\n", expected, std::log(expected));
double test = sumEntries * std::log(expected);
printf(" test = %.17g\n", test);
test = expected - test;
printf(" test = %.17g\n", test);
#if 0
double extra = doOffset
? (expected - sumEntries) - sumEntries * (std::log(expected) - std::log(sumEntries))
: expected - sumEntries * std::log(expected);
#endif
double extra = expected - sumEntries * std::log(expected);
printf(" extra = %.17g\n", extra);I get the following output: with the corresponding assembly code being (with some annotations; the There are other problems with expecting bit-wise identical output across platforms. For example |
|
@hahnjo hm that's an interesting clue, but then I still don't fully understand why Or is the problem that the FMA operation on the different parts of the sum (the likelihood is calculated over multiple event ranges in the test that is failing, which are summed afterwards) has higher rounding errors on non-x86 so that in the end the result indeed differs? Or, even more problematically, could such FMA operations also cause similar rounding errors on x86 so that EXPECT_EQ indeed becomes impossible (and I just got lucky with the current parameters on x86)? |
Yes, you are absolutely right, I got carried away by finding the first / one of the differences by comparing x86 to mac13arm. Looking some more why One solution to this would be to add the
See above; the additional problem is that the (optimizing) compiler will insert FMA operations if it sees fit (and is allowed to do so). So depending on how the code and the arithmetic operations are formulated, you may get different results depending on platform, software versions, optimization level, hardware, etc.
On x86, you are saved by the fact that, as far as I can tell, FMA instructions only come from some SIMD extension that the compiler cannot use unconditionally. This is different on AArch64 (and apparently also PPC64LE) where it appears it is in the base ISA and always available. But yes, if in the future FMA is always available on x86 or you compile (parts of) the code with vector instructions ( In conclusion, I urge you to reconsider expecting bit-wise identical results. I think I've provided plenty of evidence why it's not a good idea, I don't know what more to say... |
|
@egpbos @guitargeek is this ok to merge then? |
|
You convinced me that the subevent-section tests should be
You are right! This can easily be confirmed (even just on my x86 laptop) by changing the seed on line 66. Setting it to some other values, the I was not able to find a seed that fails the other tests, so it seems to be contained to the To be complete, I think we should add the explanation of why this comparison doesn't work, which is that the calculations are, in fact, different. We're comparing a single-section ("all events") calculation to multi-section ("events section 1" + "section 2" + ...). Obviously, this can give differences due to rounding from FMA or just Kahan summation which also has a small error sometimes. In other tests in the Anyway, for this PR: I will suggest an explanation comment and after that it can be merged (@guitargeek agrees, we discussed it in the RooFit dev meeting just now). I will also open an issue reminding us to also apply this change to the two |
The failure happens on ppc64le and aarch64.
This PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test.
[ RUN ] SimBinnedConstrainedTest.SubEventSections
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:293: Failure
Expected equality of these values:
whole.Sum()
Which is: -1263.796050661927
N_events_total_parts.Sum()
Which is: -1263.7960506619268
/builddir/build/BUILD/root-6.28.00/roofit/roofitcore/test/TestStatistics/testRooAbsL.cxx:303: Failure
Expected equality of these values:
whole.Sum()
Which is: -1263.796050661927
thrice_N_events_total_parts.Sum()
Which is: -1263.7960506619268
[ FAILED ] SimBinnedConstrainedTest.SubEventSections (199 ms)
d6a11ac to
3e0e63a
Compare
|
Starting build on |
guitargeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very thorough discussion! I amended the comment suggested by @egpbos in this PR, and will merge it one the CI passes.
|
Build failed on mac12/noimt. Failing tests:
|
|
Build failed on windows10/cxx14. Failing tests: |
This Pull request:
The testRooAbsL test compares two doubles and fails due to rounding errors.
The failure happens on ppc64le and aarch64.
Changes or fixes:
This PR changes EXPECT_EQ to EXPECT_DOUBLE_EQ in the affected test.
Checklist: