-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detecting when propensities overflow #43
Conversation
arrow/obsidian.c
Outdated
} | ||
} | ||
printf("largest reaction is %d at %f\n", max_reaction, propensities[max_reaction]); | ||
total = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be more helpful for debugging if this returned the NaN
[<-- note the capitalization] so the caller or the Python wrapper can check for that and raise an exception instead of printing to stdout without context. Also dropping the NaN loses the info on which kind of NaN it was.
BTW it might be able to avoid the overflow sometimes by using extended precision long double
math and/or rearranging the order of operations to interleave multiplication and division.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @1fish2, thanks for this! It is a good question how to best handle this kind of failure.
It could be more helpful for debugging if this returned the NaN
The NaN
arises in an intermediate value (total
) which is not returned, but instead guides the rest of the computation. StochasticSystem
returns a dictionary of five values:
- steps
- time
- events
- occurrences
- outcome
The current consideration is that the failure could have happened many steps into the simulation, which did produce reasonable results until that failure. Do we throw those results away? An argument could be made there, for now I am actually returning the useful parts of the results in case that is actually useful to the client.
The main concern I think you raise that needs to be solved is that we want to communicate the failure back to the client in a way they can respond to programmatically, rather than the current "triage" message, which is more of a suggestion to fix the offending reaction and run it again. For this we could add another key to the result, like status
perhaps? That describes the result, kind of like a bash exit code (0 for success, other numbers for failures). This would take more work to thread that result back through the c/cython machinery, but could be worth it if there was some meaningful action the client could take during the run of the simulation. Maybe just aborting would be most correct, but we could leave that up to the client.
Currently I don't see demand for that feature (we are going to be using it to identify the offending reaction(s) and decompose the stoichiometry into subreactions) but for completeness/correctness sake it makes sense to provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW it might be able to avoid the overflow sometimes by using extended precision long double math and/or rearranging the order of operations to interleave multiplication and division.
We are currently using float64
for the propensity calculations. I don't see a higher-bit data type here, besides complex128
which just uses two float64
for its two components: https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html
Are you referring to the 80-bit extended double precision? https://en.wikipedia.org/wiki/Extended_precision
I saw some references to 128-bit quadruple precision values, but they seem to be compiler dependent.
and/or rearranging the order of operations to interleave multiplication and division.
This overflow is happening in a loop while adding together thousands of large numbers. @tahorst suggested projecting to log space, which is an interesting approach to look into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed maybe this is a low priority. In that case let's file an Issue for later. If this printf message was buried in the logs of 100 sims, I'd be lucky to notice it and would have no clue how to track down what went wrong where or which output values and downstream generations were GIGO. Cython can raise an exception and that'd fix it.
I'm under the impression that long double
in C is an 80-bit extended precision float.
An overflow from adding lots of large numbers must mean LOTS of numbers that're already near overflow. Doing it in log space is not easy and would reduce precision. It'd be easier to scale them down by say 1000 into a more reasonable range; and if the total will be divided by some factor anyway it's even easier to divide them by that before adding (rearranging the order of operations). Still maybe to fix in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cython can raise an exception and that'd fix it.
Okay, I made a stab at this: da7bcb3, though I am currently getting a weird error when trying to add a new status
field to the evolve_result
struct, cython does not seem to find it, even though it is definitely there:
(arrow) spanglry@spanglry:~/Code/arrow$ make clean compile
USE_CYTHON=1 python setup.py build_ext --inplace
running build_ext
cythoning arrow/arrowhead.pyx to arrow/arrowhead.c
Error compiling Cython file:
------------------------------------------------------------
...
Return None or a tuple (steps, time, events, outcome).
"""
# TODO(jerry): Check the state[] array size?
evolved = obsidian.evolve(&self.info, duration, &state[0], &rates[0])
cdef int status = evolved.status
^
------------------------------------------------------------
arrow/arrowhead.pyx:108:33: Object of type 'evolve_result' has no attribute 'status'
building 'arrow.arrowhead' extension
C compiler: gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC
compile options: '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -Iarrow -I/home/spanglry/Code/arrow/arrow -I/home/spanglry/.pyenv/versions/arrow/lib/python2.7/site-packages/numpy/core/include -I/home/spanglry/.pyenv/versions/2.7.15/include/python2.7 -c'
gcc: arrow/arrowhead.c
arrow/arrowhead.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
#error Do not use this file, it is the result of a failed Cython compilation.
^~~~~
arrow/arrowhead.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
#error Do not use this file, it is the result of a failed Cython compilation.
^~~~~
error: Command "gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -Iarrow -I/home/spanglry/Code/arrow/arrow -I/home/spanglry/.pyenv/versions/arrow/lib/python2.7/site-packages/numpy/core/include -I/home/spanglry/.pyenv/versions/2.7.15/include/python2.7 -c arrow/arrowhead.c -o build/temp.linux-x86_64-2.7/arrow/arrowhead.o" failed with exit status 1
Makefile:10: recipe for target 'compile' failed
make: *** [compile] Error 1
Any thoughts on that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the status
branch if you want to check it out. I've tried compiling just the obsidian stuff first, then compiling the cython (trying to "reset" the header state??) but I get the same result. I have no idea why it won't find the new header? Very strange.
Perhaps it is finding the old header somewhere else on the system?
…status field in evolve_result struct
Just update |
Ha, nice! Thought that was one of the auto-generated ones, good to know. So Cython requires a redeclaration of headers then? Okay, that worked so I pulled it into this PR. |
I was printing out the numbers and it overflowed about halfway through the 1000+ reactions. That number above is absurdly large, and it is only one of 1000 numbers we are summing. I'm not sure an order of 3 would make a difference. As I recall we did some tests where we changed the rates to something absurdly low and got the same result. Actually I just tried it again changing all the rates from 1e3 to 1e-30 and the largest propensity changed from
to
and still overflowed. So however it works, just scaling is not sufficient to solve the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Where do these huge propensities come from? Why not scale them all down by 1e-30
, preferably at the source? Did you do that in large-initial.json
? GitHub won't display a useful diff.
Why does adding large numbers produce a |
The basis of the algorithm relies on an "n choose k" operation, which is essentially a ratio of factorials: https://github.com/CovertLab/arrow/blob/master/arrow/obsidian.c#L16-L25 This is fine with lower counts, but as the reactions become larger (involving more reactants) and the monomers available increase, more terms are piled on this factorial until it blows up. Scaling is insufficient because it is essentially multiplying by a single number, whereas the The other factor here is that we are being unrealistic in our stoichiometry. Declaring a reaction that involves hundreds of reactants simultaneously merging into a single complex is chemically implausible, and the fact that it leads to these unreasonably huge exponential numbers is kind of an affirmation of that.
This one I am not sure, though I have witnessed it happen. I suspect that a single term became infinity, then was added to produce NaN, but I'll take a closer look here. |
Okay, found the culprit. One of the reactions (index 392) uses 120 units of a particular monomer at the same time (along with some other monomers). The counts in the problem for this monomer are 19460.
Here is what happens when I run this in the python console:
So, not a mistake. Once this propensity is equal to infinity, summing over all propensities ends up as NaN. Why do you get NaN when adding a number to infinity instead of just infinity? That seems like it would make more sense, at least mathematically, but it looks like we would have to take that up with the floating point implementors. |
Good work tracking that part down, Ryan. BTW combinations should be efficient to compute in log space.
No, the IEEE floating point standard carefully defines the values and behavior.
|
Aha, yes! That's what I thought. It turns out though, if you multiply
|
Aha! Naturally the floating point math is in no position to take a limit. |
We have had a couple incidents where propensity calculations overflow when grappling with large counts of massive complexes. Previously this would fail silently and exhibit in terms of negative counts or in one case an infinite loop.
I have added some simple logic to detect when the total propensity becomes NaN and abort the simulation. It also prints out a helpful error message pointing towards the reaction with the largest propensity, so that the client can refactor the stoichiometry for this reaction to something more reasonable/tractable/computable in 64-bit floating point registers.
Example output: