Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting when propensities overflow #43

Merged
merged 6 commits into from
Mar 6, 2020
Merged

Detecting when propensities overflow #43

merged 6 commits into from
Mar 6, 2020

Conversation

prismofeverything
Copy link
Member

We have had a couple incidents where propensity calculations overflow when grappling with large counts of massive complexes. Previously this would fail silently and exhibit in terms of negative counts or in one case an infinite loop.

I have added some simple logic to detect when the total propensity becomes NaN and abort the simulation. It also prints out a helpful error message pointing towards the reaction with the largest propensity, so that the client can refactor the stoichiometry for this reaction to something more reasonable/tractable/computable in 64-bit floating point registers.

Example output:

failed simulation: total propensity is NAN
largest reaction is 390 at 65750108189886363191576674736750174090067616495572572584950851546902195800567499629394079127738955399755607759040963476860612556881920.000000

arrow/obsidian.c Outdated
}
}
printf("largest reaction is %d at %f\n", max_reaction, propensities[max_reaction]);
total = 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be more helpful for debugging if this returned the NaN [<-- note the capitalization] so the caller or the Python wrapper can check for that and raise an exception instead of printing to stdout without context. Also dropping the NaN loses the info on which kind of NaN it was.

BTW it might be able to avoid the overflow sometimes by using extended precision long double math and/or rearranging the order of operations to interleave multiplication and division.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @1fish2, thanks for this! It is a good question how to best handle this kind of failure.

It could be more helpful for debugging if this returned the NaN

The NaN arises in an intermediate value (total) which is not returned, but instead guides the rest of the computation. StochasticSystem returns a dictionary of five values:

  • steps
  • time
  • events
  • occurrences
  • outcome

The current consideration is that the failure could have happened many steps into the simulation, which did produce reasonable results until that failure. Do we throw those results away? An argument could be made there, for now I am actually returning the useful parts of the results in case that is actually useful to the client.

The main concern I think you raise that needs to be solved is that we want to communicate the failure back to the client in a way they can respond to programmatically, rather than the current "triage" message, which is more of a suggestion to fix the offending reaction and run it again. For this we could add another key to the result, like status perhaps? That describes the result, kind of like a bash exit code (0 for success, other numbers for failures). This would take more work to thread that result back through the c/cython machinery, but could be worth it if there was some meaningful action the client could take during the run of the simulation. Maybe just aborting would be most correct, but we could leave that up to the client.

Currently I don't see demand for that feature (we are going to be using it to identify the offending reaction(s) and decompose the stoichiometry into subreactions) but for completeness/correctness sake it makes sense to provide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it might be able to avoid the overflow sometimes by using extended precision long double math and/or rearranging the order of operations to interleave multiplication and division.

We are currently using float64 for the propensity calculations. I don't see a higher-bit data type here, besides complex128 which just uses two float64 for its two components: https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

Are you referring to the 80-bit extended double precision? https://en.wikipedia.org/wiki/Extended_precision

I saw some references to 128-bit quadruple precision values, but they seem to be compiler dependent.

and/or rearranging the order of operations to interleave multiplication and division.

This overflow is happening in a loop while adding together thousands of large numbers. @tahorst suggested projecting to log space, which is an interesting approach to look into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed maybe this is a low priority. In that case let's file an Issue for later. If this printf message was buried in the logs of 100 sims, I'd be lucky to notice it and would have no clue how to track down what went wrong where or which output values and downstream generations were GIGO. Cython can raise an exception and that'd fix it.

I'm under the impression that long double in C is an 80-bit extended precision float.

An overflow from adding lots of large numbers must mean LOTS of numbers that're already near overflow. Doing it in log space is not easy and would reduce precision. It'd be easier to scale them down by say 1000 into a more reasonable range; and if the total will be divided by some factor anyway it's even easier to divide them by that before adding (rearranging the order of operations). Still maybe to fix in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython can raise an exception and that'd fix it.

Okay, I made a stab at this: da7bcb3, though I am currently getting a weird error when trying to add a new status field to the evolve_result struct, cython does not seem to find it, even though it is definitely there:

(arrow) spanglry@spanglry:~/Code/arrow$ make clean compile
USE_CYTHON=1 python setup.py build_ext --inplace
running build_ext
cythoning arrow/arrowhead.pyx to arrow/arrowhead.c

Error compiling Cython file:
------------------------------------------------------------
...
        Return None or a tuple (steps, time, events, outcome).
        """
        # TODO(jerry): Check the state[] array size?

        evolved = obsidian.evolve(&self.info, duration, &state[0], &rates[0])
        cdef int status = evolved.status
                                ^
------------------------------------------------------------

arrow/arrowhead.pyx:108:33: Object of type 'evolve_result' has no attribute 'status'
building 'arrow.arrowhead' extension
C compiler: gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC

compile options: '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -Iarrow -I/home/spanglry/Code/arrow/arrow -I/home/spanglry/.pyenv/versions/arrow/lib/python2.7/site-packages/numpy/core/include -I/home/spanglry/.pyenv/versions/2.7.15/include/python2.7 -c'
gcc: arrow/arrowhead.c
arrow/arrowhead.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
 #error Do not use this file, it is the result of a failed Cython compilation.
  ^~~~~
arrow/arrowhead.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
 #error Do not use this file, it is the result of a failed Cython compilation.
  ^~~~~
error: Command "gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -Iarrow -I/home/spanglry/Code/arrow/arrow -I/home/spanglry/.pyenv/versions/arrow/lib/python2.7/site-packages/numpy/core/include -I/home/spanglry/.pyenv/versions/2.7.15/include/python2.7 -c arrow/arrowhead.c -o build/temp.linux-x86_64-2.7/arrow/arrowhead.o" failed with exit status 1
Makefile:10: recipe for target 'compile' failed
make: *** [compile] Error 1

Any thoughts on that one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the status branch if you want to check it out. I've tried compiling just the obsidian stuff first, then compiling the cython (trying to "reset" the header state??) but I get the same result. I have no idea why it won't find the new header? Very strange.

Perhaps it is finding the old header somewhere else on the system?

@1fish2
Copy link
Contributor

1fish2 commented Mar 5, 2020

Just update arrow/obsidian.pxd.

@prismofeverything
Copy link
Member Author

Just update arrow/obsidian.pxd.

Ha, nice! Thought that was one of the auto-generated ones, good to know. So Cython requires a redeclaration of headers then?

Okay, that worked so I pulled it into this PR.

@prismofeverything
Copy link
Member Author

An overflow from adding lots of large numbers must mean LOTS of numbers that're already near overflow. Doing it in log space is not easy and would reduce precision. It'd be easier to scale them down by say 1000 into a more reasonable range;

I was printing out the numbers and it overflowed about halfway through the 1000+ reactions. That number above is absurdly large, and it is only one of 1000 numbers we are summing. I'm not sure an order of 3 would make a difference. As I recall we did some tests where we changed the rates to something absurdly low and got the same result.


Actually I just tried it again changing all the rates from 1e3 to 1e-30 and the largest propensity changed from

65750108189886363191576674736750174090067616495572572584950851546902195800567499629394079127738955399755607759040963476860612556881920.000000

to

65750108189886357227963482866563683838097223704780262071616035300317153418688979065540010070533021368320.000000

and still overflowed. So however it works, just scaling is not sufficient to solve the problem.

Copy link
Contributor

@1fish2 1fish2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Where do these huge propensities come from? Why not scale them all down by 1e-30, preferably at the source? Did you do that in large-initial.json? GitHub won't display a useful diff.

@1fish2
Copy link
Contributor

1fish2 commented Mar 5, 2020

Why does adding large numbers produce a NaN instead of an infinity?

@prismofeverything
Copy link
Member Author

Where do these huge propensities come from?

The basis of the algorithm relies on an "n choose k" operation, which is essentially a ratio of factorials: https://github.com/CovertLab/arrow/blob/master/arrow/obsidian.c#L16-L25

This is fine with lower counts, but as the reactions become larger (involving more reactants) and the monomers available increase, more terms are piled on this factorial until it blows up. Scaling is insufficient because it is essentially multiplying by a single number, whereas the choose operation is multiplying by many numbers over and over again.

The other factor here is that we are being unrealistic in our stoichiometry. Declaring a reaction that involves hundreds of reactants simultaneously merging into a single complex is chemically implausible, and the fact that it leads to these unreasonably huge exponential numbers is kind of an affirmation of that.

Why does adding large numbers produce a NaN instead of an infinity?

This one I am not sure, though I have witnessed it happen. I suspect that a single term became infinity, then was added to produce NaN, but I'll take a closer look here.

@prismofeverything
Copy link
Member Author

Why does adding large numbers produce a NaN instead of an infinity?

Okay, found the culprit. One of the reactions (index 392) uses 120 units of a particular monomer at the same time (along with some other monomers). The counts in the problem for this monomer are 19460.

choose(19460, 120) ---> inf

Here is what happens when I run this in the python console:

In [3]: from scipy.special import comb

In [4]: comb(19460, 120)
Out[4]: inf

So, not a mistake. Once this propensity is equal to infinity, summing over all propensities ends up as NaN. Why do you get NaN when adding a number to infinity instead of just infinity? That seems like it would make more sense, at least mathematically, but it looks like we would have to take that up with the floating point implementors.

Ryan Spangler added 2 commits March 5, 2020 14:14
@1fish2
Copy link
Contributor

1fish2 commented Mar 5, 2020

Good work tracking that part down, Ryan. BTW combinations should be efficient to compute in log space.

Why do you get NaN when adding a number to infinity instead of just infinity? That seems like it would make more sense, at least mathematically, but it looks like we would have to take that up with the floating point implementors.

No, the IEEE floating point standard carefully defines the values and behavior.

>>> i = float("inf"); i
inf
>>> i + 1
inf
>>> 1 + i
inf
>>> 2 * i
inf
>>> 1e10 * i
inf
>>> i * 1e30
inf
>>> sum(i for _ in range(10000))
inf
>>> i/i
nan

@prismofeverything
Copy link
Member Author

No, the IEEE floating point standard carefully defines the values and behavior.

Aha, yes! That's what I thought. It turns out though, if you multiply inf by 0 (which is what's happening in this case), you get NaN:

In [1]: float('inf') * 0
Out[1]: nan

@1fish2
Copy link
Contributor

1fish2 commented Mar 6, 2020

Aha! Naturally the floating point math is in no position to take a limit.

@prismofeverything prismofeverything merged commit ba4bf0c into master Mar 6, 2020
@1fish2 1fish2 deleted the error branch December 10, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants