Detecting when propensities overflow #43

prismofeverything · 2020-03-04T19:47:42Z

We have had a couple incidents where propensity calculations overflow when grappling with large counts of massive complexes. Previously this would fail silently and exhibit in terms of negative counts or in one case an infinite loop.

I have added some simple logic to detect when the total propensity becomes NaN and abort the simulation. It also prints out a helpful error message pointing towards the reaction with the largest propensity, so that the client can refactor the stoichiometry for this reaction to something more reasonable/tractable/computable in 64-bit floating point registers.

Example output:

failed simulation: total propensity is NAN
largest reaction is 390 at 65750108189886363191576674736750174090067616495572572584950851546902195800567499629394079127738955399755607759040963476860612556881920.000000

1fish2 · 2020-03-04T23:27:25Z

arrow/obsidian.c

+        }
+      }
+      printf("largest reaction is %d at %f\n", max_reaction, propensities[max_reaction]);
+      total = 0.0;


It could be more helpful for debugging if this returned the NaN [<-- note the capitalization] so the caller or the Python wrapper can check for that and raise an exception instead of printing to stdout without context. Also dropping the NaN loses the info on which kind of NaN it was.

BTW it might be able to avoid the overflow sometimes by using extended precision long double math and/or rearranging the order of operations to interleave multiplication and division.

Hey @1fish2, thanks for this! It is a good question how to best handle this kind of failure.

It could be more helpful for debugging if this returned the NaN

The NaN arises in an intermediate value (total) which is not returned, but instead guides the rest of the computation. StochasticSystem returns a dictionary of five values:

steps

time

events

occurrences

outcome

The current consideration is that the failure could have happened many steps into the simulation, which did produce reasonable results until that failure. Do we throw those results away? An argument could be made there, for now I am actually returning the useful parts of the results in case that is actually useful to the client.

The main concern I think you raise that needs to be solved is that we want to communicate the failure back to the client in a way they can respond to programmatically, rather than the current "triage" message, which is more of a suggestion to fix the offending reaction and run it again. For this we could add another key to the result, like status perhaps? That describes the result, kind of like a bash exit code (0 for success, other numbers for failures). This would take more work to thread that result back through the c/cython machinery, but could be worth it if there was some meaningful action the client could take during the run of the simulation. Maybe just aborting would be most correct, but we could leave that up to the client.

Currently I don't see demand for that feature (we are going to be using it to identify the offending reaction(s) and decompose the stoichiometry into subreactions) but for completeness/correctness sake it makes sense to provide.

BTW it might be able to avoid the overflow sometimes by using extended precision long double math and/or rearranging the order of operations to interleave multiplication and division.

We are currently using float64 for the propensity calculations. I don't see a higher-bit data type here, besides complex128 which just uses two float64 for its two components: https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

Are you referring to the 80-bit extended double precision? https://en.wikipedia.org/wiki/Extended_precision

I saw some references to 128-bit quadruple precision values, but they seem to be compiler dependent.

and/or rearranging the order of operations to interleave multiplication and division.

This overflow is happening in a loop while adding together thousands of large numbers. @tahorst suggested projecting to log space, which is an interesting approach to look into.

Indeed maybe this is a low priority. In that case let's file an Issue for later. If this printf message was buried in the logs of 100 sims, I'd be lucky to notice it and would have no clue how to track down what went wrong where or which output values and downstream generations were GIGO. Cython can raise an exception and that'd fix it.

I'm under the impression that long double in C is an 80-bit extended precision float.

An overflow from adding lots of large numbers must mean LOTS of numbers that're already near overflow. Doing it in log space is not easy and would reduce precision. It'd be easier to scale them down by say 1000 into a more reasonable range; and if the total will be divided by some factor anyway it's even easier to divide them by that before adding (rearranging the order of operations). Still maybe to fix in the future.

Cython can raise an exception and that'd fix it.

Okay, I made a stab at this: da7bcb3, though I am currently getting a weird error when trying to add a new status field to the evolve_result struct, cython does not seem to find it, even though it is definitely there:

(arrow) spanglry@spanglry:~/Code/arrow$ make clean compile USE_CYTHON=1 python setup.py build_ext --inplace running build_ext cythoning arrow/arrowhead.pyx to arrow/arrowhead.c Error compiling Cython file: ------------------------------------------------------------ ... Return None or a tuple (steps, time, events, outcome). """ # TODO(jerry): Check the state[] array size? evolved = obsidian.evolve(&self.info, duration, &state[0], &rates[0]) cdef int status = evolved.status ^ ------------------------------------------------------------ arrow/arrowhead.pyx:108:33: Object of type 'evolve_result' has no attribute 'status' building 'arrow.arrowhead' extension C compiler: gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC compile options: '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -Iarrow -I/home/spanglry/Code/arrow/arrow -I/home/spanglry/.pyenv/versions/arrow/lib/python2.7/site-packages/numpy/core/include -I/home/spanglry/.pyenv/versions/2.7.15/include/python2.7 -c' gcc: arrow/arrowhead.c arrow/arrowhead.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation. #error Do not use this file, it is the result of a failed Cython compilation. ^~~~~ arrow/arrowhead.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation. #error Do not use this file, it is the result of a failed Cython compilation. ^~~~~ error: Command "gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -Iarrow -I/home/spanglry/Code/arrow/arrow -I/home/spanglry/.pyenv/versions/arrow/lib/python2.7/site-packages/numpy/core/include -I/home/spanglry/.pyenv/versions/2.7.15/include/python2.7 -c arrow/arrowhead.c -o build/temp.linux-x86_64-2.7/arrow/arrowhead.o" failed with exit status 1 Makefile:10: recipe for target 'compile' failed make: *** [compile] Error 1

Any thoughts on that one?

On the status branch if you want to check it out. I've tried compiling just the obsidian stuff first, then compiling the cython (trying to "reset" the header state??) but I get the same result. I have no idea why it won't find the new header? Very strange.

Perhaps it is finding the old header somewhere else on the system?

…status field in evolve_result struct

1fish2 · 2020-03-05T01:15:43Z

Just update arrow/obsidian.pxd.

prismofeverything · 2020-03-05T02:00:34Z

Just update arrow/obsidian.pxd.

Ha, nice! Thought that was one of the auto-generated ones, good to know. So Cython requires a redeclaration of headers then?

Okay, that worked so I pulled it into this PR.

prismofeverything · 2020-03-05T02:12:40Z

An overflow from adding lots of large numbers must mean LOTS of numbers that're already near overflow. Doing it in log space is not easy and would reduce precision. It'd be easier to scale them down by say 1000 into a more reasonable range;

I was printing out the numbers and it overflowed about halfway through the 1000+ reactions. That number above is absurdly large, and it is only one of 1000 numbers we are summing. I'm not sure an order of 3 would make a difference. As I recall we did some tests where we changed the rates to something absurdly low and got the same result.

Actually I just tried it again changing all the rates from 1e3 to 1e-30 and the largest propensity changed from

65750108189886363191576674736750174090067616495572572584950851546902195800567499629394079127738955399755607759040963476860612556881920.000000

to

65750108189886357227963482866563683838097223704780262071616035300317153418688979065540010070533021368320.000000

and still overflowed. So however it works, just scaling is not sufficient to solve the problem.

1fish2

LGTM.

Where do these huge propensities come from? Why not scale them all down by 1e-30, preferably at the source? Did you do that in large-initial.json? GitHub won't display a useful diff.

1fish2 · 2020-03-05T05:38:11Z

Why does adding large numbers produce a NaN instead of an infinity?

prismofeverything · 2020-03-05T19:10:30Z

Where do these huge propensities come from?

The basis of the algorithm relies on an "n choose k" operation, which is essentially a ratio of factorials: https://github.com/CovertLab/arrow/blob/master/arrow/obsidian.c#L16-L25

This is fine with lower counts, but as the reactions become larger (involving more reactants) and the monomers available increase, more terms are piled on this factorial until it blows up. Scaling is insufficient because it is essentially multiplying by a single number, whereas the choose operation is multiplying by many numbers over and over again.

The other factor here is that we are being unrealistic in our stoichiometry. Declaring a reaction that involves hundreds of reactants simultaneously merging into a single complex is chemically implausible, and the fact that it leads to these unreasonably huge exponential numbers is kind of an affirmation of that.

Why does adding large numbers produce a NaN instead of an infinity?

This one I am not sure, though I have witnessed it happen. I suspect that a single term became infinity, then was added to produce NaN, but I'll take a closer look here.

prismofeverything · 2020-03-05T22:09:18Z

Why does adding large numbers produce a NaN instead of an infinity?

Okay, found the culprit. One of the reactions (index 392) uses 120 units of a particular monomer at the same time (along with some other monomers). The counts in the problem for this monomer are 19460.

choose(19460, 120) ---> inf

Here is what happens when I run this in the python console:

In [3]: from scipy.special import comb

In [4]: comb(19460, 120)
Out[4]: inf

So, not a mistake. Once this propensity is equal to infinity, summing over all propensities ends up as NaN. Why do you get NaN when adding a number to infinity instead of just infinity? That seems like it would make more sense, at least mathematically, but it looks like we would have to take that up with the floating point implementors.

1fish2 · 2020-03-05T23:53:29Z

Good work tracking that part down, Ryan. BTW combinations should be efficient to compute in log space.

Why do you get NaN when adding a number to infinity instead of just infinity? That seems like it would make more sense, at least mathematically, but it looks like we would have to take that up with the floating point implementors.

No, the IEEE floating point standard carefully defines the values and behavior.

>>> i = float("inf"); i
inf
>>> i + 1
inf
>>> 1 + i
inf
>>> 2 * i
inf
>>> 1e10 * i
inf
>>> i * 1e30
inf
>>> sum(i for _ in range(10000))
inf
>>> i/i
nan

prismofeverything · 2020-03-06T00:19:30Z

No, the IEEE floating point standard carefully defines the values and behavior.

Aha, yes! That's what I thought. It turns out though, if you multiply inf by 0 (which is what's happening in this case), you get NaN:

In [1]: float('inf') * 0
Out[1]: nan

1fish2 · 2020-03-06T00:23:21Z

Aha! Naturally the floating point math is in no position to take a limit.

detecting when propensities overflow

dd10de7

1fish2 reviewed Mar 4, 2020

View reviewed changes

trying to return status, compilation currently failing to detect new …

da7bcb3

…status field in evolve_result struct

update obsidian.pxd with new header definition

3cb837a

reduce rates to 1e-30

b6fd24f

1fish2 approved these changes Mar 5, 2020

View reviewed changes

Ryan Spangler added 2 commits March 5, 2020 14:14

NAN --> NaN

e113d73

version

6763f36

prismofeverything merged commit ba4bf0c into master Mar 6, 2020

1fish2 deleted the error branch December 10, 2024 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting when propensities overflow #43

Detecting when propensities overflow #43

prismofeverything commented Mar 4, 2020

1fish2 Mar 4, 2020

prismofeverything Mar 4, 2020

prismofeverything Mar 4, 2020

1fish2 Mar 5, 2020

prismofeverything Mar 5, 2020

prismofeverything Mar 5, 2020

1fish2 commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

1fish2 left a comment

1fish2 commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

1fish2 commented Mar 5, 2020

prismofeverything commented Mar 6, 2020

1fish2 commented Mar 6, 2020

Detecting when propensities overflow #43

Detecting when propensities overflow #43

Conversation

prismofeverything commented Mar 4, 2020

1fish2 Mar 4, 2020

Choose a reason for hiding this comment

prismofeverything Mar 4, 2020

Choose a reason for hiding this comment

prismofeverything Mar 4, 2020

Choose a reason for hiding this comment

1fish2 Mar 5, 2020

Choose a reason for hiding this comment

prismofeverything Mar 5, 2020

Choose a reason for hiding this comment

prismofeverything Mar 5, 2020

Choose a reason for hiding this comment

1fish2 commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

1fish2 left a comment

Choose a reason for hiding this comment

1fish2 commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

prismofeverything commented Mar 5, 2020

1fish2 commented Mar 5, 2020

prismofeverything commented Mar 6, 2020

1fish2 commented Mar 6, 2020