Simplify the TimeIndependentMDCObjectiveFunction class #515

rileyjmurray · 2024-12-16T18:17:32Z

TimeIndependentMDCObjectiveFunction is one of the most complicated classes I've encountered in pyGSTi. Right now ...

It's 1000 lines long.
It has 12 private instance methods.
It has hard-to-track dependencies between functions for computing "lsvec-related" quantities and "terms-related" quantities.

This PR simplifies TimeIndependentMDCObjectiveFunction so that

It's 577 lines long. (Granted, it would be more like high-600s if I kept repetitive docstrings in.)
It has 4 private instance methods.
All "lsvec-related" quantities are computed by applying simple transformations to "terms-related" quantities.

These changes should materially improve maintainability of TimeIndependentMDCObjectiveFunction, as well as its 8 subclasses that collectively span ~500 lines. The outward behavior of this class should be the same as before up to floating point rounding errors.

For the curious, I went down this rabbit hole when working on PR #506.

Incidental changes related to MPI testing: moved some lightweight MPI tests from test/test_packages to test/unit. This move was prompted by this PR because this PR touches code that involves a lot of MPI. While I was at it I renamed an MPI performance profiling file so that it didn't get misinterpreted as a test file.

rileyjmurray · 2024-12-18T13:09:19Z

I encourage anyone who's reading this comment to skip to the Conclusion heading.

Investigating possible changes in numerical behavior

During the 12/17 dev meeting Erik raised the question of how numerics might be different by focusing on "terms" and computing "lsvec" as a function of terms.

changes to TimeIndependentMDCObjectiveFunction.dterms

The screenshot-of-screenshots below shows the new code and old code.
Code that's hidden in folded regions (or otherwise off-screen) is changed only in incidental ways.
From what is shown, the operations are equivalent even in finite-precision arithmetic.
Nothing to worry about here!

changes to TimeIndependentMDCObjectiveFunction.dlsvec

This one is trickier to explain. Let's start with new code and old code.

In order to assess differences we need to dig into the call that the old code makes to self.raw_objfn.dlsvec_and_lsvec, and specifically into the nature of the first of this function's two returned arrays.
There are three definitions of this function in objectivefns.py.

Clearly, only the first of these definitions is nontrivial.
To dig further we need to investigate raw_objfn.dlsvec functions. There is a default implementation of this function and an implementation in RawChi2Function.

Let's dig into the default implementation first.

I'll call attention to line 635. The definition of pt5_over_lsvec looks odd, but it's actually equivalent to the definition of p5over_lsvec at lines 4605-4606 in the new code objectivefns.py.
Putting things all together, the old code computes dlsvec by evaluating

dprobs * (raw_dterms * pt5_over_raw_lsvec)[:, None]

while the new code computes dlsvec as

(dprobs * raw_dterms[:, None]) * pt5_over_raw_lsvec[:, None]

The only differences that could arise here are from non-associativity of floating point multiplication. Such differences can certainly happen. However, there is no fundamental reason to prefer the current approach over the new approach. Overall conclusion: no cause for worry when we end up calling RawObjectiveFunction.dlsvec.

So ... what about RawChi2Function.dlsvec? The implementation there is nontrivial, unfortunately. It relies on two private helper functions: _weights and _dweights:

The new code I introduce in this PR does end up calling those functions. But they get passed through the default implementation of RawObjectiveFunction.terms and RawObjectiveFunctions.dterms, which look like this

So inverting these transformations (particularly, dividing by lsvec) could be numerically messy when components of lsvec are very small. But as long as we're not passing transformed quantities through n-ary reductions (e.g., a sum over many terms) the numerical differences should be modest.

changes to TimeIndependentMDCObjectiveFunction.terms

There are no numerical changes here.

changes to TimeIndependentMDCObjectiveFunction.lsvec

Here's the new and old code.

The old implementation basically just calls rawobj_fn.lsvec. The new implementation relies on TimeIndependentMDCObjective.terms, which ends up basically just calling raw_objfn.terms. So we need to need to see how raw_objfn.lsvec and raw_objfn.terms can compare. Another property of the new implementation is the addition of an awkward extra argument called "raw_objfn_lsvec_signs," which defaults to True. It's pretty obvious that this extra argument makes sure that signs of the returned lsvec in the new implementation can match those of the old implementation. I'll speak to why this matters in a bit. For now, let's look at implementations of lsvec and terms.

The default implementations of lsvec and terms RawObjectiveFunction have a circular dependency (see below) so it's necessary for subclasses to override one of these functions.

Two subclasses of RawObjectiveFunction end up overriding RawObjectiveFunction.lsvec. The implementation in RawChi2Function is

The main takeaway there is that its lsvec can contain negative components. Meanwhile, the RawPoissonPicDeltaLogLFunction class uses

Based on this, the relationship between the new and old outputs of TimeIndependentMDCObjectiveFunction.lsvec when using these raw objective functions lsvec_new = ( lsvec_old ** 2) ** 0.5 when raw_objfn_lsvec_signs==False and lsvec_new = np.sign(lsvec_old) * ( lsvec_old ** 2) ** 0.5 when raw_objfn_lsvec_signs==True. From a numerical perspective the presence of a square and then a square root should have a pretty minor effect. As with my analysis of changes to behavior of dlsvec, what matters is that we're not changing any n-ary operations, only binary and unary operations.

Now let's return to why there's the stupid raw_objfn_lsvec_signs argument in the new implementation. From a mathematical perspective we should be able to make lsvec nonnegative without loss of generality. Indeed, if think about the (nonlinear) least squares problem in which the objective appears, it's clear that any component of lsvec which is negative can be made positive if you multiply the corresponding row of dlsvec by -1.

Here is the problem: there are unit tests that are written in a way which does not respect the freedom of sign choice for lsvec. I tried to adapt the tests to acknowledge the sign freedom a few weeks ago, but I got nowhere. It was much simpler to bite the bullet in the new lsvec implementation and have it get signs from raw_objfn.lsvec by default. This has two downsides, both of which are minor.

Downside 1. It makes TimeIndependentMDCObjective.lsvec twice as expensive by default. This should be fine because evaluating the lsvec is nowhere near as expensive as evaluating dlsvec.
Downside 2. It means that subclasses TimeIndependentMDCObjective call the lsvec method of their raw_objfn by default. This is undesirable because a major reason for this PR effort was to make it so a new RawObjectiveFunction can implement "terms" and "dterms" and then be done with things. Yet, this is a minor issue since any such TimeIndependentMDCStore subclass only needs to implement its lsvec method as

    def lsvec(self, paramvec=None, oob_check=False):
        return TimeIndependentMDCObjective.lsvec(self, paramvec, oob_check, False)

Conclusion

These changes should not be problematic from a numerical perspective. I've taken extra steps to ensure minimal outward behavior change (i.e., keeping the possibility that TimeIndependentMDCStore.lsvec returns a vector with negative components). On the off chance that we do see an uptick in numerical issues with GST model fitting, we have this detailed investigation to refer back to.

@sserita, @coreyostrove, please put me out of my misery.

…me a file in test/performance so that it doesnt get discovered by testing frameworks like pytest

…t commit will simpify further, which may have consequences from the perspective of floating point arithmetic. Checking in NOW in case we want to revert to this version for numerical reasons.

sserita

Thanks for the huge effort @rileyjmurray, and the very detailed post explaining the differences. That helped a lot.
I tried to have maximal mercy but I couldn't stop myself from having two minor comment-related requests - add those in and it'll be a quick approve from me.

pygsti/objectivefns/objectivefns.py

test/unit/mpi/test_mpi.py

sserita · 2025-01-17T17:20:30Z

test/test_packages/drivers/test_calcmethods1Q.py

                                                               allowed_perr=0.1, max_paths_per_outcome=1000,
                                                               perr_heuristic='meanscaled', max_term_stages=5)
-        target_model.from_vector(1e-10 * np.ones(target_model.num_params))  # to seed term calc (starting with perfect zeros causes trouble)
+        target_model.from_vector(1e-9 * np.ones(target_model.num_params))  # to seed term calc (starting with perfect zeros causes trouble)


Why this change?

Uhhh .... if I didn't make it, then the test failed nastily. The optimizer just didn't converge.

I can compare side by side and see what the threshold is for the old code vs new code.

@sserita, I'm very glad you asked this question. It turns out that the non-convergence I observed stemmed from a bug in the LM optimizer. Specifically, it stemmed from how the LM optimizer decided to update the "best_x."

The implementations on develop (i.e., both the default SimplerLMOptimizer and the backup CustomLMOptimizer) would only update best_x if norm_f < min_norm_f and a boolean flag called new_x_is_known_in_bounds was True. But new_x_is_known_in_bounds is really more of a sufficient condition for being in bounds than a necessary condition. It makes more sense to do this:

If norm_f < min_norm_f, then check if new_x_is_known_in_bounds == True. If it is, then we take this flag's word for it and update best_x[:] = x. Otherwise, we explicitly check if x is in bounds and we update best_x[:] = x if and only if that's the case.

If I make that change then the test passes without making the change that prompted your comment.

See below for my investigation, written to a future pyGSTi developer who might come across this.

Investigation

Stefan's comment concerned a change I made in test/test_packages/drivers/test_calcmethods1Q.py::test_stdgst_prunedpath. The change was to replace a scalar that was hard-coded to 1e-10 with a scalar hard-coded to 1e-9. Let's call that scalar perturb. So the old value is perturb=1e-10.

I ran this test on develop with different values of perturb; those runs passed when perturb >= 5e-13 and failed when perturb <= 2.5e-13. Then I ran this test on the current branch using the commit at the time that Stefan left his comment. Those runs passed when perturb >= 2e-10 and failed when perturb <= 1.125.

I started to write a report of these observations when I noticed something funny. In the test log it was clear that the LM optimizer was terminating outer iterations with a much larger value for norm_f than the minimum value seen across the outer iterations (specific logs are shown below). This made me suspect that best_x wasn't being set properly. Sure enough, I found this issue where a sufficient condition for an iterate being in-bounds (new_x_is_known_in_bounds) was effectively used as a necessary condition. After updating the termination criteria I was able to get this problematic unit test to pass when perturb==1e-10. In fact, it passed with perturb==1e-12. I didn't bother checking smaller values.

develop, passing with 5e-13

Last iteration: --- dlogl GST --- Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True). --- Outer Iter 0: norm_f = 166109, mu=1, |x|=2.44949e-12, |J|=2835.48 --- Outer Iter 1: norm_f = 166104, mu=979.996, |x|=0.00324549, |J|=2835.51 --- Outer Iter 2: norm_f = 608.149, mu=326.665, |x|=0.255707, |J|=3040.87 --- Outer Iter 3: norm_f = 95.8116, mu=168.158, |x|=0.276211, |J|=3017.42 --- Outer Iter 4: norm_f = 31.8847, mu=56.0528, |x|=0.268233, |J|=3017.41 --- Outer Iter 5: norm_f = 31.864, mu=18.6843, |x|=0.268133, |J|=3017.33 --- Outer Iter 6: norm_f = 166104, mu=979.996, |x|=0.00324549, |J|=3017.33 Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Least squares message = Objective function out-of-bounds! STOP 2*Delta(log(L)) = 332207 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0) Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0) (avg per circuit paths=200, magnitude=2.055, target=2.036) After adapting paths, num failures = 0, 10.0% of allowed paths used. --- Outer Iter 0: norm_f = 166104, mu=979.996, |x|=0.00324549, |J|=2835.51 --- Outer Iter 1: norm_f = 608.149, mu=326.665, |x|=0.255707, |J|=3040.87 --- Outer Iter 2: norm_f = 95.8116, mu=168.158, |x|=0.276211, |J|=3017.42 --- Outer Iter 3: norm_f = 31.8847, mu=56.0528, |x|=0.268233, |J|=3017.41 --- Outer Iter 4: norm_f = 31.864, mu=18.6843, |x|=0.268133, |J|=3017.33 --- Outer Iter 5: norm_f = 608.149, mu=326.665, |x|=0.255707, |J|=3017.33 --- Outer Iter 6: norm_f = 607.403, mu=1.52598e+07, |x|=0.255282, |J|=3040.98 --- Outer Iter 7: norm_f = 591.59, mu=5.08659e+06, |x|=0.255303, |J|=3040.85 --- Outer Iter 8: norm_f = 548.357, mu=1.69553e+06, |x|=0.255395, |J|=3040.46 --- Outer Iter 9: norm_f = 448.118, mu=565177, |x|=0.255827, |J|=3039.43 --- Outer Iter 10: norm_f = 289.143, mu=188392, |x|=0.257372, |J|=3037.08 --- Outer Iter 11: norm_f = 150.619, mu=62797.4, |x|=0.260537, |J|=3032.8 --- Outer Iter 12: norm_f = 68.2405, mu=20932.5, |x|=0.26454, |J|=3026.4 --- Outer Iter 13: norm_f = 36.1194, mu=6977.49, |x|=0.267382, |J|=3020.41 --- Outer Iter 14: norm_f = 31.9729, mu=2325.83, |x|=0.268035, |J|=3017.84 --- Outer Iter 15: norm_f = 31.8644, mu=775.277, |x|=0.268122, |J|=3017.37 --- Outer Iter 16: norm_f = 31.864, mu=258.426, |x|=0.268135, |J|=3017.33 Least squares message = Both actual and predicted relative reductions in the sum of squares are at most 1e-06 2*Delta(log(L)) = 63.7279 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0.0168613) Term-states Converged! Completed in 0.7s Final optimization took 0.7s

develop, passing with 2e-10

Last iteration: --- dlogl GST --- Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True). --- Outer Iter 0: norm_f = 166109, mu=1, |x|=9.79796e-10, |J|=2835.48 --- Outer Iter 1: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=2851.83 --- Outer Iter 2: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3265.36 --- Outer Iter 3: norm_f = 3024.04, mu=117205, |x|=0.198562, |J|=3065.56 --- Outer Iter 4: norm_f = 223.51, mu=39068.3, |x|=0.250841, |J|=3037.55 --- Outer Iter 5: norm_f = 61.1352, mu=13022.8, |x|=0.264528, |J|=3026.48 --- Outer Iter 6: norm_f = 34.14, mu=4340.92, |x|=0.267594, |J|=3019.84 --- Outer Iter 7: norm_f = 31.9082, mu=1446.97, |x|=0.268025, |J|=3017.7 --- Outer Iter 8: norm_f = 31.8645, mu=482.324, |x|=0.26812, |J|=3017.36 --- Outer Iter 9: norm_f = 31.864, mu=160.775, |x|=0.268135, |J|=3017.33 --- Outer Iter 10: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=3017.33 Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Least squares message = Objective function out-of-bounds! STOP 2*Delta(log(L)) = 332181 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0) Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0) (avg per circuit paths=200, magnitude=2.055, target=2.036) After adapting paths, num failures = 0, 10.0% of allowed paths used. --- Outer Iter 0: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=2851.83 Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) --- Outer Iter 1: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3265.36 --- Outer Iter 2: norm_f = 3024.04, mu=117205, |x|=0.198562, |J|=3065.56 --- Outer Iter 3: norm_f = 223.51, mu=39068.3, |x|=0.250841, |J|=3037.55 --- Outer Iter 4: norm_f = 61.1352, mu=13022.8, |x|=0.264528, |J|=3026.48 --- Outer Iter 5: norm_f = 34.14, mu=4340.92, |x|=0.267594, |J|=3019.84 --- Outer Iter 6: norm_f = 31.9082, mu=1446.97, |x|=0.268025, |J|=3017.7 --- Outer Iter 7: norm_f = 31.8645, mu=482.324, |x|=0.26812, |J|=3017.36 --- Outer Iter 8: norm_f = 31.864, mu=160.775, |x|=0.268135, |J|=3017.33 --- Outer Iter 9: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3017.33 --- Outer Iter 10: norm_f = 24207.2, mu=111502, |x|=0.0993976, |J|=3245.77 --- Outer Iter 11: norm_f = 1743.06, mu=37167.3, |x|=0.216206, |J|=3057.74 --- Outer Iter 12: norm_f = 177.863, mu=12389.1, |x|=0.26331, |J|=3032.07 --- Outer Iter 13: norm_f = 47.4228, mu=4129.7, |x|=0.269905, |J|=3020.44 --- Outer Iter 14: norm_f = 31.995, mu=1376.57, |x|=0.268271, |J|=3017.63 --- Outer Iter 15: norm_f = 31.8641, mu=458.855, |x|=0.268137, |J|=3017.34 --- Outer Iter 16: norm_f = 31.864, mu=137.657, |x|=0.268136, |J|=3017.33 Least squares message = Both actual and predicted relative reductions in the sum of squares are at most 1e-06 2*Delta(log(L)) = 63.7279 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0.0168613) Term-states Converged! Completed in 0.6s Final optimization took 0.6s

This PR, passing with 2e-10

Last iteration: --- dlogl GST --- Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True). --- Outer Iter 0: norm_f = 166109, mu=1, |x|=9.79796e-10, |J|=2835.48 --- Outer Iter 1: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=2851.83 --- Outer Iter 2: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3265.36 --- Outer Iter 3: norm_f = 3024.04, mu=117205, |x|=0.198562, |J|=3065.56 --- Outer Iter 4: norm_f = 223.51, mu=39068.3, |x|=0.250841, |J|=3037.55 --- Outer Iter 5: norm_f = 61.1352, mu=13022.8, |x|=0.264528, |J|=3026.48 --- Outer Iter 6: norm_f = 34.14, mu=4340.92, |x|=0.267594, |J|=3019.84 --- Outer Iter 7: norm_f = 31.9082, mu=1446.97, |x|=0.268025, |J|=3017.7 --- Outer Iter 8: norm_f = 31.8645, mu=482.324, |x|=0.26812, |J|=3017.36 --- Outer Iter 9: norm_f = 31.864, mu=160.775, |x|=0.268135, |J|=3017.33 --- Outer Iter 10: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=3017.33 --- Outer Iter 11: norm_f = 166090, mu=293.999, |x|=0.0032486, |J|=2851.52 Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Least squares message = Objective function out-of-bounds! STOP 2*Delta(log(L)) = 332180 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0) Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0) (avg per circuit paths=200, magnitude=2.055, target=2.036) After adapting paths, num failures = 0, 10.0% of allowed paths used. --- Outer Iter 0: norm_f = 166090, mu=293.999, |x|=0.0032486, |J|=2851.52 Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) --- Outer Iter 1: norm_f = 696.596, mu=100352, |x|=0.273829, |J|=3033.45 --- Outer Iter 2: norm_f = 145.292, mu=33450.5, |x|=0.26568, |J|=3028.08 --- Outer Iter 3: norm_f = 50.8609, mu=11150.2, |x|=0.267615, |J|=3021.55 --- Outer Iter 4: norm_f = 33.0403, mu=3716.73, |x|=0.268405, |J|=3017.96 --- Outer Iter 5: norm_f = 31.8829, mu=1238.91, |x|=0.268202, |J|=3017.31 --- Outer Iter 6: norm_f = 31.8642, mu=412.97, |x|=0.268145, |J|=3017.31 --- Outer Iter 7: norm_f = 31.864, mu=123.891, |x|=0.268137, |J|=3017.33 --- Outer Iter 8: norm_f = 696.596, mu=100352, |x|=0.273829, |J|=3017.33 --- Outer Iter 9: norm_f = 696.594, mu=30105.5, |x|=0.273829, |J|=3033.45 --- Outer Iter 10: norm_f = 80.7071, mu=10035.2, |x|=0.269167, |J|=3023.24 --- Outer Iter 11: norm_f = 34.0435, mu=3345.05, |x|=0.268725, |J|=3018.22 --- Outer Iter 12: norm_f = 31.886, mu=1115.02, |x|=0.268213, |J|=3017.33 --- Outer Iter 13: norm_f = 31.8642, mu=371.673, |x|=0.268145, |J|=3017.32 --- Outer Iter 14: norm_f = 31.864, mu=111.502, |x|=0.268137, |J|=3017.33 Least squares message = Both actual and predicted relative reductions in the sum of squares are at most 1e-06 2*Delta(log(L)) = 63.7279 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0.0168613) Term-states Converged! Completed in 0.6s Final optimization took 0.6s

This PR, failing with 1e-10

Last iteration: --- dlogl GST --- Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True). --- Outer Iter 0: norm_f = 166109, mu=1, |x|=4.89898e-10, |J|=2835.48 --- Outer Iter 1: norm_f = 166100, mu=979.996, |x|=0.00324622, |J|=2839.6 --- Outer Iter 2: norm_f = 52952, mu=334505, |x|=0.0493353, |J|=3716.22 --- Outer Iter 3: norm_f = 8281.27, mu=203584, |x|=0.159582, |J|=3101.88 --- Outer Iter 4: norm_f = 631.08, mu=67861.3, |x|=0.234876, |J|=3044.93 --- Outer Iter 5: norm_f = 97.1942, mu=22620.4, |x|=0.260353, |J|=3031.15 --- Outer Iter 6: norm_f = 40.7621, mu=7540.14, |x|=0.266602, |J|=3022.47 --- Outer Iter 7: norm_f = 32.2466, mu=2513.38, |x|=0.26786, |J|=3018.4 --- Outer Iter 8: norm_f = 31.8699, mu=837.794, |x|=0.268084, |J|=3017.45 --- Outer Iter 9: norm_f = 31.864, mu=279.265, |x|=0.26813, |J|=3017.34 --- Outer Iter 10: norm_f = 31.864, mu=83.7794, |x|=0.268136, |J|=3017.33 --- Outer Iter 11: norm_f = 166100, mu=979.996, |x|=0.00324622, |J|=3017.33 --- Outer Iter 12: norm_f = 166100, mu=293.999, |x|=0.00324625, |J|=2839.55 Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Least squares message = Objective function out-of-bounds! STOP 2*Delta(log(L)) = 332201 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0) Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0) (avg per circuit paths=200, magnitude=2.055, target=2.036) After adapting paths, num failures = 0, 10.0% of allowed paths used. --- Outer Iter 0: norm_f = 166100, mu=293.999, |x|=0.00324625, |J|=2839.55 Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1) --- Outer Iter 1: norm_f = 8859.56, mu=100352, |x|=0.157859, |J|=3102.33 --- Outer Iter 2: norm_f = 392.573, mu=33450.5, |x|=0.243907, |J|=3037.75 --- Outer Iter 3: norm_f = 56.3503, mu=11150.2, |x|=0.265337, |J|=3024.63 --- Outer Iter 4: norm_f = 33.251, mu=3716.73, |x|=0.268011, |J|=3018.92 --- Outer Iter 5: norm_f = 31.8752, mu=1238.91, |x|=0.268108, |J|=3017.49 --- Outer Iter 6: norm_f = 31.864, mu=412.97, |x|=0.268133, |J|=3017.34 --- Outer Iter 7: norm_f = 8859.56, mu=100352, |x|=0.157859, |J|=3017.34 Least squares message = Relative change, |dx|/|x|, is at most 1e-08 2*Delta(log(L)) = 17719.1 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0) Term-states Converged! Completed in 0.6s Final optimization took 0.6s -- Performing 'go0' gauge optimization on GateSetTomography estimate -- MISFIT nSigma = 1928.7314162397222

Fantastic work tracking this down, @rileyjmurray!

Thanks, @coreyostrove! I feel pretty good about this one. Who knows how many past calls to LM suffered from this bug. It's wild to think about, since LM was already performing very well, and now it'll perform even better :)

Agreed, nice work and as always, the detailed writeups are incredibly appreciated!

pygsti/objectivefns/objectivefns.py

…ODO comment

…rease 1e-10 to 2e-10 instead of increasing 1e-10 to 1e-9).

rileyjmurray · 2025-01-17T22:15:17Z

@sserita once tests pass we should be good to merge :)

sserita

Thanks as always for the great work, and tracking down that bug in the LM code!

While playing around with MPI parallelization across parameters (instead of circuits, which the default for the map simulator as of 0.9.13) I ran into a bug which I tracked down to a shared memory bug inadvertently introduced with PR #515. Attached is a script, which if run using MPI (I was using 4 ranks) runs a single iteration of two-qubit GST on a single thread and on all four threads and logs the result for comparison. [debug_dist_objective_fn.py](https://github.com/user-attachments/files/22715098/debug_dist_objective_fn.py). Running this script on 0.9.14.1 yields the following output: ``` Number of Ranks is 4 Fit Log Single Rank ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (1,) MapLayout: 1 processors divided into 1 x 1 (= 1) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (1 cores) *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 1 param-processors. *** Divided 1-host atom-processor (~1 procs) into 1 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 4.5053e+09, mu=1, |x|=0, |J|=3.86694e+08 - Inner Loop: mu=1.542e+12, norm_dx=4.69768e-06 (cont): norm_new_f=1289.29, dL=4.42172e+09, dF=4.5053e+09, reldL=0.981449, reldF=1 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=1.0189 mu * 0.333333 => 5.14e+11 Least squares message = Maximum iterations (1) exceeded _objfn = 1289.29 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.4s Iteration 1 took 1.5s Last iteration: --- logl GST --- --- Outer Iter 0: norm_f = 644.825, mu=1, |x|=0.00216741, |J|=556063 - Inner Loop: mu=3.08599e+06, norm_dx=8.62909e-06 (cont): norm_new_f=549.175, dL=95.681, dF=95.65, reldL=0.148383, reldF=0.148335 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.999676 mu * 0.333333 => 1.02866e+06 Least squares message = Maximum iterations (1) exceeded _objfn = 1098.35 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.3s Final optimization took 1.3s Fit Log Four Ranks ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (4,) MapLayout: 4 processors divided into 1 x 4 (= 4) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (4 cores) *** *** Using shared memory to communicate between the 4 cores on each of 1 hosts *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 4 param-processors. *** Divided 1-host atom-processor (~4 procs) into 4 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 6381.58, mu=1, |x|=0, |J|=3.66935e+11 - Inner Loop: mu=1.44141e+18, norm_dx=4.27565e-18 (cont): norm_new_f=6177.24, dL=4871.93, dF=204.333, reldL=0.763437, reldF=0.0320192 Accepted! gain ratio=0.0419408 mu * 0.3 => 4.32423e+17 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Least squares message = Maximum iterations (1) exceeded _objfn = 6177.24 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Iteration 1 took 0.6s Last iteration: --- logl GST --- C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec --- Outer Iter 0: norm_f = 2617.65, mu=1, |x|=2.06777e-09, |J|=6.23454e+11 - Inner Loop: mu=4.05554e+18, norm_dx=3.83592e-19 (cont): norm_new_f=2605.54, dL=1399.13, dF=12.1112, reldL=0.534499, reldF=0.00462674 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.00865623 mu * 0.3 => 1.21666e+18 Least squares message = Maximum iterations (1) exceeded _objfn = 5211.07 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Final optimization took 0.5s ``` Obviously looking at the final objective function values the results are quite different, but what caught my eye is that this was true even at the very initial step before the optimizer had moved at all: Single threaded: norm_f = 4.5053e+09 Multi-threaded: norm_f = 6381.58 Shared memory is a real mind-bender, so I'll spare folks the details on how I tracked this down, but ultimately I tracked this down to [this line](https://github.com/sandialabs/pyGSTi/blob/339c8a3df6e84c9a154e04d42e090ab9d27f78b2/pygsti/objectivefns/objectivefns.py#L4415). The bug here arises because in the multi-threaded setting `lsvec` is an instance of `LocalNumpyArray`, which generally points to an array that lives in shared memory. The square-root on that line was not being guarded, and since each rank was pointing to the same memory that resulted in the square-root being applied as many as `num_ranks` times (in my testing this was actually nondeterministic, though usually 3 or 4, probably a race condition in play). I patterned matched off of other parts of `objectivefns.py` where shared memory is being used (mainly `terms` in `TimeIndependentMDCObjectiveFunction`) and think I got the correct guards added in (the patch seemed to work in my testing). I added back in a few comments that were removed in #515 related to the use of shared memory in this part of the code which I found useful in debugging this. P.S. As for why this wasn't caught sooner, the default forward simulator since 0.9.13 is `MapForwardSimulator`, and the default paralellization strategy for map is across circuits. It happens that when parallelizing only across circuits the portions of the shared memory being indexed into by each rank are non-overlapping, but that isn't true when parallelizing across parameters. So you'd have had to have manually changed the parallelization strategy to see this behavior, which most users don't do unless they have a specific reason to. Co-authored-by: Corey Ostrove <ciostro@sandia.gov>

Note: For some reason the automatic merge from bugfix into develop failed on PR #660, so this PR is just manually performing that merge. While playing around with MPI parallelization across parameters (instead of circuits, which the default for the map simulator as of 0.9.13) I ran into a bug which I tracked down to a shared memory bug inadvertently introduced with PR #515. Attached is a script, which if run using MPI (I was using 4 ranks) runs a single iteration of two-qubit GST on a single thread and on all four threads and logs the result for comparison. [debug_dist_objective_fn.py](https://github.com/user-attachments/files/22715098/debug_dist_objective_fn.py). Running this script on 0.9.14.1 yields the following output: ``` Number of Ranks is 4 Fit Log Single Rank ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (1,) MapLayout: 1 processors divided into 1 x 1 (= 1) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (1 cores) *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 1 param-processors. *** Divided 1-host atom-processor (~1 procs) into 1 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 4.5053e+09, mu=1, |x|=0, |J|=3.86694e+08 - Inner Loop: mu=1.542e+12, norm_dx=4.69768e-06 (cont): norm_new_f=1289.29, dL=4.42172e+09, dF=4.5053e+09, reldL=0.981449, reldF=1 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=1.0189 mu * 0.333333 => 5.14e+11 Least squares message = Maximum iterations (1) exceeded _objfn = 1289.29 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.4s Iteration 1 took 1.5s Last iteration: --- logl GST --- --- Outer Iter 0: norm_f = 644.825, mu=1, |x|=0.00216741, |J|=556063 - Inner Loop: mu=3.08599e+06, norm_dx=8.62909e-06 (cont): norm_new_f=549.175, dL=95.681, dF=95.65, reldL=0.148383, reldF=0.148335 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.999676 mu * 0.333333 => 1.02866e+06 Least squares message = Maximum iterations (1) exceeded _objfn = 1098.35 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.3s Final optimization took 1.3s Fit Log Four Ranks ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (4,) MapLayout: 4 processors divided into 1 x 4 (= 4) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (4 cores) *** *** Using shared memory to communicate between the 4 cores on each of 1 hosts *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 4 param-processors. *** Divided 1-host atom-processor (~4 procs) into 4 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 6381.58, mu=1, |x|=0, |J|=3.66935e+11 - Inner Loop: mu=1.44141e+18, norm_dx=4.27565e-18 (cont): norm_new_f=6177.24, dL=4871.93, dF=204.333, reldL=0.763437, reldF=0.0320192 Accepted! gain ratio=0.0419408 mu * 0.3 => 4.32423e+17 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Least squares message = Maximum iterations (1) exceeded _objfn = 6177.24 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Iteration 1 took 0.6s Last iteration: --- logl GST --- C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec --- Outer Iter 0: norm_f = 2617.65, mu=1, |x|=2.06777e-09, |J|=6.23454e+11 - Inner Loop: mu=4.05554e+18, norm_dx=3.83592e-19 (cont): norm_new_f=2605.54, dL=1399.13, dF=12.1112, reldL=0.534499, reldF=0.00462674 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.00865623 mu * 0.3 => 1.21666e+18 Least squares message = Maximum iterations (1) exceeded _objfn = 5211.07 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Final optimization took 0.5s ``` Obviously looking at the final objective function values the results are quite different, but what caught my eye is that this was true even at the very initial step before the optimizer had moved at all: Single threaded: norm_f = 4.5053e+09 Multi-threaded: norm_f = 6381.58 Shared memory is a real mind-bender, so I'll spare folks the details on how I tracked this down, but ultimately I tracked this down to [this line](https://github.com/sandialabs/pyGSTi/blob/339c8a3df6e84c9a154e04d42e090ab9d27f78b2/pygsti/objectivefns/objectivefns.py#L4415). The bug here arises because in the multi-threaded setting `lsvec` is an instance of `LocalNumpyArray`, which generally points to an array that lives in shared memory. The square-root on that line was not being guarded, and since each rank was pointing to the same memory that resulted in the square-root being applied as many as `num_ranks` times (in my testing this was actually nondeterministic, though usually 3 or 4, probably a race condition in play). I patterned matched off of other parts of `objectivefns.py` where shared memory is being used (mainly `terms` in `TimeIndependentMDCObjectiveFunction`) and think I got the correct guards added in (the patch seemed to work in my testing). I added back in a few comments that were removed in #515 related to the use of shared memory in this part of the code which I found useful in debugging this. P.S. As for why this wasn't caught sooner, the default forward simulator since 0.9.13 is `MapForwardSimulator`, and the default paralellization strategy for map is across circuits. It happens that when parallelizing only across circuits the portions of the shared memory being indexed into by each rank are non-overlapping, but that isn't true when parallelizing across parameters. So you'd have had to have manually changed the parallelization strategy to see this behavior, which most users don't do unless they have a specific reason to.

rileyjmurray changed the title ~~Simplify the TimeIndependentMDCObjectiveFunction class~~ WIP: simplify the TimeIndependentMDCObjectiveFunction class Dec 16, 2024

rileyjmurray marked this pull request as ready for review December 17, 2024 11:48

rileyjmurray requested a review from a team as a code owner December 17, 2024 11:48

rileyjmurray requested a review from sserita December 17, 2024 11:48

rileyjmurray changed the title ~~WIP: simplify the TimeIndependentMDCObjectiveFunction class~~ Simplify the TimeIndependentMDCObjectiveFunction class Dec 17, 2024

rileyjmurray requested review from a team, adhumu, kevincyoung, kmrudin and pcwysoc as code owners January 16, 2025 20:17

rileyjmurray force-pushed the simplify-TimeIndependentMDCObjectiveFunction branch from 18a29b7 to 1c23251 Compare January 16, 2025 20:19

rileyjmurray added 13 commits January 16, 2025 15:20

initial commit simplifying TimeIndependentMDCObjectiveFunction

f0edb04

bugfixes

768f248

improve readability

609b7a9

inline comment

a62de79

add pytest dispatcher for MPI test in test_packages

dd764cf

move simple MPI test from test/test_packages into test/unit, and rena…

4590263

…me a file in test/performance so that it doesnt get discovered by testing frameworks like pytest

implement changes discussed in 12/17 dev meeting

00e4d16

expand scope of exception handling

8b304c3

add separate TimeIndpendentMDCObjectiveFunction.dlsvec back

8f641da

resolve an infuriating sign error. Will probably want to clean this up.

1d32520

check in slightly simpler (but still complicated) implementation. Nex…

3f8029a

…t commit will simpify further, which may have consequences from the perspective of floating point arithmetic. Checking in NOW in case we want to revert to this version for numerical reasons.

simplifications complete

6efa45e

adjust size of perturbation from zero in test_stdgst_prunedpath

c33542e

rileyjmurray force-pushed the simplify-TimeIndependentMDCObjectiveFunction branch from 1c23251 to c33542e Compare January 16, 2025 20:21

sserita requested changes Jan 17, 2025

View reviewed changes

sserita added this to the 0.9.13.1 milestone Jan 17, 2025

rileyjmurray added 5 commits January 17, 2025 13:44

split out duplicated complicated line into its own function; remove T…

157d999

…ODO comment

have perturbation from zero be closer to what it was before (only inc…

588cd78

…rease 1e-10 to 2e-10 instead of increasing 1e-10 to 1e-9).

fix LM bugs

ec105d6

simplify bugfix from last commit

5779011

remove variable added for debugging

59ab4d3

sserita approved these changes Jan 20, 2025

View reviewed changes

sserita merged commit d5a3421 into develop Jan 20, 2025
4 checks passed

sserita deleted the simplify-TimeIndependentMDCObjectiveFunction branch January 20, 2025 22:47

coreyostrove mentioned this pull request Oct 6, 2025

Shared Memory Bug #660

Merged

coreyostrove mentioned this pull request Oct 7, 2025

Shared Memory Bug (#660) (Merge Into Develop) #662

Merged

Uh oh!

Simplify the TimeIndependentMDCObjectiveFunction class #515

Simplify the TimeIndependentMDCObjectiveFunction class #515

Uh oh!

Conversation

rileyjmurray commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rileyjmurray commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigating possible changes in numerical behavior

changes to TimeIndependentMDCObjectiveFunction.dterms

changes to TimeIndependentMDCObjectiveFunction.dlsvec

changes to TimeIndependentMDCObjectiveFunction.terms

changes to TimeIndependentMDCObjectiveFunction.lsvec

Conclusion

Uh oh!

sserita left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sserita Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

rileyjmurray Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rileyjmurray Jan 17, 2025

Choose a reason for hiding this comment

Investigation

develop, passing with 5e-13

develop, passing with 2e-10

This PR, passing with 2e-10

This PR, failing with 1e-10

Uh oh!

coreyostrove Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

rileyjmurray Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

sserita Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rileyjmurray commented Jan 17, 2025

Uh oh!

sserita left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rileyjmurray commented Dec 16, 2024 •

edited

Loading

rileyjmurray commented Dec 18, 2024 •

edited

Loading

rileyjmurray Jan 17, 2025 •

edited

Loading