-
Couldn't load subscription status.
- Fork 58
Simplify the TimeIndependentMDCObjectiveFunction class #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify the TimeIndependentMDCObjectiveFunction class #515
Conversation
|
I encourage anyone who's reading this comment to skip to the Conclusion heading. Investigating possible changes in numerical behaviorDuring the 12/17 dev meeting Erik raised the question of how numerics might be different by focusing on "terms" and computing "lsvec" as a function of terms. changes to TimeIndependentMDCObjectiveFunction.dtermsThe screenshot-of-screenshots below shows the new code and old code. changes to TimeIndependentMDCObjectiveFunction.dlsvecThis one is trickier to explain. Let's start with new code and old code. Let's dig into the default implementation first. dprobs * (raw_dterms * pt5_over_raw_lsvec)[:, None]while the new code computes dlsvec as (dprobs * raw_dterms[:, None]) * pt5_over_raw_lsvec[:, None]The only differences that could arise here are from non-associativity of floating point multiplication. Such differences can certainly happen. However, there is no fundamental reason to prefer the current approach over the new approach. Overall conclusion: no cause for worry when we end up calling RawObjectiveFunction.dlsvec. So ... what about RawChi2Function.dlsvec? The implementation there is nontrivial, unfortunately. It relies on two private helper functions: changes to TimeIndependentMDCObjectiveFunction.termsThere are no numerical changes here. changes to TimeIndependentMDCObjectiveFunction.lsvecHere's the new and old code. The default implementations of lsvec and terms Based on this, the relationship between the new and old outputs of Now let's return to why there's the stupid Here is the problem: there are unit tests that are written in a way which does not respect the freedom of sign choice for lsvec. I tried to adapt the tests to acknowledge the sign freedom a few weeks ago, but I got nowhere. It was much simpler to bite the bullet in the new lsvec implementation and have it get signs from raw_objfn.lsvec by default. This has two downsides, both of which are minor.
def lsvec(self, paramvec=None, oob_check=False):
return TimeIndependentMDCObjective.lsvec(self, paramvec, oob_check, False)ConclusionThese changes should not be problematic from a numerical perspective. I've taken extra steps to ensure minimal outward behavior change (i.e., keeping the possibility that TimeIndependentMDCStore.lsvec returns a vector with negative components). On the off chance that we do see an uptick in numerical issues with GST model fitting, we have this detailed investigation to refer back to. @sserita, @coreyostrove, please put me out of my misery. |
18a29b7 to
1c23251
Compare
…me a file in test/performance so that it doesnt get discovered by testing frameworks like pytest
…t commit will simpify further, which may have consequences from the perspective of floating point arithmetic. Checking in NOW in case we want to revert to this version for numerical reasons.
1c23251 to
c33542e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge effort @rileyjmurray, and the very detailed post explaining the differences. That helped a lot.
I tried to have maximal mercy but I couldn't stop myself from having two minor comment-related requests - add those in and it'll be a quick approve from me.
| allowed_perr=0.1, max_paths_per_outcome=1000, | ||
| perr_heuristic='meanscaled', max_term_stages=5) | ||
| target_model.from_vector(1e-10 * np.ones(target_model.num_params)) # to seed term calc (starting with perfect zeros causes trouble) | ||
| target_model.from_vector(1e-9 * np.ones(target_model.num_params)) # to seed term calc (starting with perfect zeros causes trouble) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhhh .... if I didn't make it, then the test failed nastily. The optimizer just didn't converge.
I can compare side by side and see what the threshold is for the old code vs new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sserita, I'm very glad you asked this question. It turns out that the non-convergence I observed stemmed from a bug in the LM optimizer. Specifically, it stemmed from how the LM optimizer decided to update the "best_x."
The implementations on develop (i.e., both the default SimplerLMOptimizer and the backup CustomLMOptimizer) would only update best_x if norm_f < min_norm_f and a boolean flag called new_x_is_known_in_bounds was True. But new_x_is_known_in_bounds is really more of a sufficient condition for being in bounds than a necessary condition. It makes more sense to do this:
If
norm_f < min_norm_f, then check ifnew_x_is_known_in_bounds == True. If it is, then we take this flag's word for it and updatebest_x[:] = x. Otherwise, we explicitly check ifxis in bounds and we updatebest_x[:] = xif and only if that's the case.
If I make that change then the test passes without making the change that prompted your comment.
See below for my investigation, written to a future pyGSTi developer who might come across this.
Investigation
Stefan's comment concerned a change I made in test/test_packages/drivers/test_calcmethods1Q.py::test_stdgst_prunedpath. The change was to replace a scalar that was hard-coded to 1e-10 with a scalar hard-coded to 1e-9. Let's call that scalar perturb. So the old value is perturb=1e-10.
I ran this test on develop with different values of perturb; those runs passed when perturb >= 5e-13 and failed when perturb <= 2.5e-13. Then I ran this test on the current branch using the commit at the time that Stefan left his comment. Those runs passed when perturb >= 2e-10 and failed when perturb <= 1.125.
I started to write a report of these observations when I noticed something funny. In the test log it was clear that the LM optimizer was terminating outer iterations with a much larger value for norm_f than the minimum value seen across the outer iterations (specific logs are shown below). This made me suspect that best_x wasn't being set properly. Sure enough, I found this issue where a sufficient condition for an iterate being in-bounds (new_x_is_known_in_bounds) was effectively used as a necessary condition. After updating the termination criteria I was able to get this problematic unit test to pass when perturb==1e-10. In fact, it passed with perturb==1e-12. I didn't bother checking smaller values.
develop, passing with 5e-13
Last iteration:
--- dlogl GST ---
Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True).
--- Outer Iter 0: norm_f = 166109, mu=1, |x|=2.44949e-12, |J|=2835.48
--- Outer Iter 1: norm_f = 166104, mu=979.996, |x|=0.00324549, |J|=2835.51
--- Outer Iter 2: norm_f = 608.149, mu=326.665, |x|=0.255707, |J|=3040.87
--- Outer Iter 3: norm_f = 95.8116, mu=168.158, |x|=0.276211, |J|=3017.42
--- Outer Iter 4: norm_f = 31.8847, mu=56.0528, |x|=0.268233, |J|=3017.41
--- Outer Iter 5: norm_f = 31.864, mu=18.6843, |x|=0.268133, |J|=3017.33
--- Outer Iter 6: norm_f = 166104, mu=979.996, |x|=0.00324549, |J|=3017.33
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Least squares message = Objective function out-of-bounds! STOP
2*Delta(log(L)) = 332207 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0)
Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0)
(avg per circuit paths=200, magnitude=2.055, target=2.036)
After adapting paths, num failures = 0, 10.0% of allowed paths used.
--- Outer Iter 0: norm_f = 166104, mu=979.996, |x|=0.00324549, |J|=2835.51
--- Outer Iter 1: norm_f = 608.149, mu=326.665, |x|=0.255707, |J|=3040.87
--- Outer Iter 2: norm_f = 95.8116, mu=168.158, |x|=0.276211, |J|=3017.42
--- Outer Iter 3: norm_f = 31.8847, mu=56.0528, |x|=0.268233, |J|=3017.41
--- Outer Iter 4: norm_f = 31.864, mu=18.6843, |x|=0.268133, |J|=3017.33
--- Outer Iter 5: norm_f = 608.149, mu=326.665, |x|=0.255707, |J|=3017.33
--- Outer Iter 6: norm_f = 607.403, mu=1.52598e+07, |x|=0.255282, |J|=3040.98
--- Outer Iter 7: norm_f = 591.59, mu=5.08659e+06, |x|=0.255303, |J|=3040.85
--- Outer Iter 8: norm_f = 548.357, mu=1.69553e+06, |x|=0.255395, |J|=3040.46
--- Outer Iter 9: norm_f = 448.118, mu=565177, |x|=0.255827, |J|=3039.43
--- Outer Iter 10: norm_f = 289.143, mu=188392, |x|=0.257372, |J|=3037.08
--- Outer Iter 11: norm_f = 150.619, mu=62797.4, |x|=0.260537, |J|=3032.8
--- Outer Iter 12: norm_f = 68.2405, mu=20932.5, |x|=0.26454, |J|=3026.4
--- Outer Iter 13: norm_f = 36.1194, mu=6977.49, |x|=0.267382, |J|=3020.41
--- Outer Iter 14: norm_f = 31.9729, mu=2325.83, |x|=0.268035, |J|=3017.84
--- Outer Iter 15: norm_f = 31.8644, mu=775.277, |x|=0.268122, |J|=3017.37
--- Outer Iter 16: norm_f = 31.864, mu=258.426, |x|=0.268135, |J|=3017.33
Least squares message = Both actual and predicted relative reductions in the sum of squares are at most 1e-06
2*Delta(log(L)) = 63.7279 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0.0168613)
Term-states Converged!
Completed in 0.7s
Final optimization took 0.7s
develop, passing with 2e-10
Last iteration:
--- dlogl GST ---
Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True).
--- Outer Iter 0: norm_f = 166109, mu=1, |x|=9.79796e-10, |J|=2835.48
--- Outer Iter 1: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=2851.83
--- Outer Iter 2: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3265.36
--- Outer Iter 3: norm_f = 3024.04, mu=117205, |x|=0.198562, |J|=3065.56
--- Outer Iter 4: norm_f = 223.51, mu=39068.3, |x|=0.250841, |J|=3037.55
--- Outer Iter 5: norm_f = 61.1352, mu=13022.8, |x|=0.264528, |J|=3026.48
--- Outer Iter 6: norm_f = 34.14, mu=4340.92, |x|=0.267594, |J|=3019.84
--- Outer Iter 7: norm_f = 31.9082, mu=1446.97, |x|=0.268025, |J|=3017.7
--- Outer Iter 8: norm_f = 31.8645, mu=482.324, |x|=0.26812, |J|=3017.36
--- Outer Iter 9: norm_f = 31.864, mu=160.775, |x|=0.268135, |J|=3017.33
--- Outer Iter 10: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=3017.33
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Least squares message = Objective function out-of-bounds! STOP
2*Delta(log(L)) = 332181 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0)
Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0)
(avg per circuit paths=200, magnitude=2.055, target=2.036)
After adapting paths, num failures = 0, 10.0% of allowed paths used.
--- Outer Iter 0: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=2851.83
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
--- Outer Iter 1: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3265.36
--- Outer Iter 2: norm_f = 3024.04, mu=117205, |x|=0.198562, |J|=3065.56
--- Outer Iter 3: norm_f = 223.51, mu=39068.3, |x|=0.250841, |J|=3037.55
--- Outer Iter 4: norm_f = 61.1352, mu=13022.8, |x|=0.264528, |J|=3026.48
--- Outer Iter 5: norm_f = 34.14, mu=4340.92, |x|=0.267594, |J|=3019.84
--- Outer Iter 6: norm_f = 31.9082, mu=1446.97, |x|=0.268025, |J|=3017.7
--- Outer Iter 7: norm_f = 31.8645, mu=482.324, |x|=0.26812, |J|=3017.36
--- Outer Iter 8: norm_f = 31.864, mu=160.775, |x|=0.268135, |J|=3017.33
--- Outer Iter 9: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3017.33
--- Outer Iter 10: norm_f = 24207.2, mu=111502, |x|=0.0993976, |J|=3245.77
--- Outer Iter 11: norm_f = 1743.06, mu=37167.3, |x|=0.216206, |J|=3057.74
--- Outer Iter 12: norm_f = 177.863, mu=12389.1, |x|=0.26331, |J|=3032.07
--- Outer Iter 13: norm_f = 47.4228, mu=4129.7, |x|=0.269905, |J|=3020.44
--- Outer Iter 14: norm_f = 31.995, mu=1376.57, |x|=0.268271, |J|=3017.63
--- Outer Iter 15: norm_f = 31.8641, mu=458.855, |x|=0.268137, |J|=3017.34
--- Outer Iter 16: norm_f = 31.864, mu=137.657, |x|=0.268136, |J|=3017.33
Least squares message = Both actual and predicted relative reductions in the sum of squares are at most 1e-06
2*Delta(log(L)) = 63.7279 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0.0168613)
Term-states Converged!
Completed in 0.6s
Final optimization took 0.6s
This PR, passing with 2e-10
Last iteration:
--- dlogl GST ---
Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True).
--- Outer Iter 0: norm_f = 166109, mu=1, |x|=9.79796e-10, |J|=2835.48
--- Outer Iter 1: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=2851.83
--- Outer Iter 2: norm_f = 26560.4, mu=334505, |x|=0.094016, |J|=3265.36
--- Outer Iter 3: norm_f = 3024.04, mu=117205, |x|=0.198562, |J|=3065.56
--- Outer Iter 4: norm_f = 223.51, mu=39068.3, |x|=0.250841, |J|=3037.55
--- Outer Iter 5: norm_f = 61.1352, mu=13022.8, |x|=0.264528, |J|=3026.48
--- Outer Iter 6: norm_f = 34.14, mu=4340.92, |x|=0.267594, |J|=3019.84
--- Outer Iter 7: norm_f = 31.9082, mu=1446.97, |x|=0.268025, |J|=3017.7
--- Outer Iter 8: norm_f = 31.8645, mu=482.324, |x|=0.26812, |J|=3017.36
--- Outer Iter 9: norm_f = 31.864, mu=160.775, |x|=0.268135, |J|=3017.33
--- Outer Iter 10: norm_f = 166090, mu=979.996, |x|=0.00324842, |J|=3017.33
--- Outer Iter 11: norm_f = 166090, mu=293.999, |x|=0.0032486, |J|=2851.52
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Least squares message = Objective function out-of-bounds! STOP
2*Delta(log(L)) = 332180 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0)
Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0)
(avg per circuit paths=200, magnitude=2.055, target=2.036)
After adapting paths, num failures = 0, 10.0% of allowed paths used.
--- Outer Iter 0: norm_f = 166090, mu=293.999, |x|=0.0032486, |J|=2851.52
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
--- Outer Iter 1: norm_f = 696.596, mu=100352, |x|=0.273829, |J|=3033.45
--- Outer Iter 2: norm_f = 145.292, mu=33450.5, |x|=0.26568, |J|=3028.08
--- Outer Iter 3: norm_f = 50.8609, mu=11150.2, |x|=0.267615, |J|=3021.55
--- Outer Iter 4: norm_f = 33.0403, mu=3716.73, |x|=0.268405, |J|=3017.96
--- Outer Iter 5: norm_f = 31.8829, mu=1238.91, |x|=0.268202, |J|=3017.31
--- Outer Iter 6: norm_f = 31.8642, mu=412.97, |x|=0.268145, |J|=3017.31
--- Outer Iter 7: norm_f = 31.864, mu=123.891, |x|=0.268137, |J|=3017.33
--- Outer Iter 8: norm_f = 696.596, mu=100352, |x|=0.273829, |J|=3017.33
--- Outer Iter 9: norm_f = 696.594, mu=30105.5, |x|=0.273829, |J|=3033.45
--- Outer Iter 10: norm_f = 80.7071, mu=10035.2, |x|=0.269167, |J|=3023.24
--- Outer Iter 11: norm_f = 34.0435, mu=3345.05, |x|=0.268725, |J|=3018.22
--- Outer Iter 12: norm_f = 31.886, mu=1115.02, |x|=0.268213, |J|=3017.33
--- Outer Iter 13: norm_f = 31.8642, mu=371.673, |x|=0.268145, |J|=3017.32
--- Outer Iter 14: norm_f = 31.864, mu=111.502, |x|=0.268137, |J|=3017.33
Least squares message = Both actual and predicted relative reductions in the sum of squares are at most 1e-06
2*Delta(log(L)) = 63.7279 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0.0168613)
Term-states Converged!
Completed in 0.6s
Final optimization took 0.6s
This PR, failing with 1e-10
Last iteration:
--- dlogl GST ---
Initial Term-stage model has 0 failures and uses 10.0% of allowed paths (ok=True).
--- Outer Iter 0: norm_f = 166109, mu=1, |x|=4.89898e-10, |J|=2835.48
--- Outer Iter 1: norm_f = 166100, mu=979.996, |x|=0.00324622, |J|=2839.6
--- Outer Iter 2: norm_f = 52952, mu=334505, |x|=0.0493353, |J|=3716.22
--- Outer Iter 3: norm_f = 8281.27, mu=203584, |x|=0.159582, |J|=3101.88
--- Outer Iter 4: norm_f = 631.08, mu=67861.3, |x|=0.234876, |J|=3044.93
--- Outer Iter 5: norm_f = 97.1942, mu=22620.4, |x|=0.260353, |J|=3031.15
--- Outer Iter 6: norm_f = 40.7621, mu=7540.14, |x|=0.266602, |J|=3022.47
--- Outer Iter 7: norm_f = 32.2466, mu=2513.38, |x|=0.26786, |J|=3018.4
--- Outer Iter 8: norm_f = 31.8699, mu=837.794, |x|=0.268084, |J|=3017.45
--- Outer Iter 9: norm_f = 31.864, mu=279.265, |x|=0.26813, |J|=3017.34
--- Outer Iter 10: norm_f = 31.864, mu=83.7794, |x|=0.268136, |J|=3017.33
--- Outer Iter 11: norm_f = 166100, mu=979.996, |x|=0.00324622, |J|=3017.33
--- Outer Iter 12: norm_f = 166100, mu=293.999, |x|=0.00324625, |J|=2839.55
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Least squares message = Objective function out-of-bounds! STOP
2*Delta(log(L)) = 332201 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0)
Pruned path-integral: kept 13212 paths (10.0%) w/magnitude 135.6 (target=134.4, #circuits=66, #failed=0)
(avg per circuit paths=200, magnitude=2.055, target=2.036)
After adapting paths, num failures = 0, 10.0% of allowed paths used.
--- Outer Iter 0: norm_f = 166100, mu=293.999, |x|=0.00324625, |J|=2839.55
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
Paths are insufficient! (Using meanscaled heuristic with error bound of 0.1)
--- Outer Iter 1: norm_f = 8859.56, mu=100352, |x|=0.157859, |J|=3102.33
--- Outer Iter 2: norm_f = 392.573, mu=33450.5, |x|=0.243907, |J|=3037.75
--- Outer Iter 3: norm_f = 56.3503, mu=11150.2, |x|=0.265337, |J|=3024.63
--- Outer Iter 4: norm_f = 33.251, mu=3716.73, |x|=0.268011, |J|=3018.92
--- Outer Iter 5: norm_f = 31.8752, mu=1238.91, |x|=0.268108, |J|=3017.49
--- Outer Iter 6: norm_f = 31.864, mu=412.97, |x|=0.268133, |J|=3017.34
--- Outer Iter 7: norm_f = 8859.56, mu=100352, |x|=0.157859, |J|=3017.34
Least squares message = Relative change, |dx|/|x|, is at most 1e-08
2*Delta(log(L)) = 17719.1 (66 data params - 24 (approx) model params = expected mean of 42; p-value = 0)
Term-states Converged!
Completed in 0.6s
Final optimization took 0.6s
-- Performing 'go0' gauge optimization on GateSetTomography estimate --
MISFIT nSigma = 1928.7314162397222
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work tracking this down, @rileyjmurray!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @coreyostrove! I feel pretty good about this one. Who knows how many past calls to LM suffered from this bug. It's wild to think about, since LM was already performing very well, and now it'll perform even better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, nice work and as always, the detailed writeups are incredibly appreciated!
…rease 1e-10 to 2e-10 instead of increasing 1e-10 to 1e-9).
|
@sserita once tests pass we should be good to merge :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks as always for the great work, and tracking down that bug in the LM code!
While playing around with MPI parallelization across parameters (instead of circuits, which the default for the map simulator as of 0.9.13) I ran into a bug which I tracked down to a shared memory bug inadvertently introduced with PR #515. Attached is a script, which if run using MPI (I was using 4 ranks) runs a single iteration of two-qubit GST on a single thread and on all four threads and logs the result for comparison. [debug_dist_objective_fn.py](https://github.com/user-attachments/files/22715098/debug_dist_objective_fn.py). Running this script on 0.9.14.1 yields the following output: ``` Number of Ranks is 4 Fit Log Single Rank ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (1,) MapLayout: 1 processors divided into 1 x 1 (= 1) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (1 cores) *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 1 param-processors. *** Divided 1-host atom-processor (~1 procs) into 1 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 4.5053e+09, mu=1, |x|=0, |J|=3.86694e+08 - Inner Loop: mu=1.542e+12, norm_dx=4.69768e-06 (cont): norm_new_f=1289.29, dL=4.42172e+09, dF=4.5053e+09, reldL=0.981449, reldF=1 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=1.0189 mu * 0.333333 => 5.14e+11 Least squares message = Maximum iterations (1) exceeded _objfn = 1289.29 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.4s Iteration 1 took 1.5s Last iteration: --- logl GST --- --- Outer Iter 0: norm_f = 644.825, mu=1, |x|=0.00216741, |J|=556063 - Inner Loop: mu=3.08599e+06, norm_dx=8.62909e-06 (cont): norm_new_f=549.175, dL=95.681, dF=95.65, reldL=0.148383, reldF=0.148335 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.999676 mu * 0.333333 => 1.02866e+06 Least squares message = Maximum iterations (1) exceeded _objfn = 1098.35 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.3s Final optimization took 1.3s Fit Log Four Ranks ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (4,) MapLayout: 4 processors divided into 1 x 4 (= 4) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (4 cores) *** *** Using shared memory to communicate between the 4 cores on each of 1 hosts *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 4 param-processors. *** Divided 1-host atom-processor (~4 procs) into 4 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 6381.58, mu=1, |x|=0, |J|=3.66935e+11 - Inner Loop: mu=1.44141e+18, norm_dx=4.27565e-18 (cont): norm_new_f=6177.24, dL=4871.93, dF=204.333, reldL=0.763437, reldF=0.0320192 Accepted! gain ratio=0.0419408 mu * 0.3 => 4.32423e+17 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Least squares message = Maximum iterations (1) exceeded _objfn = 6177.24 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Iteration 1 took 0.6s Last iteration: --- logl GST --- C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec --- Outer Iter 0: norm_f = 2617.65, mu=1, |x|=2.06777e-09, |J|=6.23454e+11 - Inner Loop: mu=4.05554e+18, norm_dx=3.83592e-19 (cont): norm_new_f=2605.54, dL=1399.13, dF=12.1112, reldL=0.534499, reldF=0.00462674 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.00865623 mu * 0.3 => 1.21666e+18 Least squares message = Maximum iterations (1) exceeded _objfn = 5211.07 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Final optimization took 0.5s ``` Obviously looking at the final objective function values the results are quite different, but what caught my eye is that this was true even at the very initial step before the optimizer had moved at all: Single threaded: norm_f = 4.5053e+09 Multi-threaded: norm_f = 6381.58 Shared memory is a real mind-bender, so I'll spare folks the details on how I tracked this down, but ultimately I tracked this down to [this line](https://github.com/sandialabs/pyGSTi/blob/339c8a3df6e84c9a154e04d42e090ab9d27f78b2/pygsti/objectivefns/objectivefns.py#L4415). The bug here arises because in the multi-threaded setting `lsvec` is an instance of `LocalNumpyArray`, which generally points to an array that lives in shared memory. The square-root on that line was not being guarded, and since each rank was pointing to the same memory that resulted in the square-root being applied as many as `num_ranks` times (in my testing this was actually nondeterministic, though usually 3 or 4, probably a race condition in play). I patterned matched off of other parts of `objectivefns.py` where shared memory is being used (mainly `terms` in `TimeIndependentMDCObjectiveFunction`) and think I got the correct guards added in (the patch seemed to work in my testing). I added back in a few comments that were removed in #515 related to the use of shared memory in this part of the code which I found useful in debugging this. P.S. As for why this wasn't caught sooner, the default forward simulator since 0.9.13 is `MapForwardSimulator`, and the default paralellization strategy for map is across circuits. It happens that when parallelizing only across circuits the portions of the shared memory being indexed into by each rank are non-overlapping, but that isn't true when parallelizing across parameters. So you'd have had to have manually changed the parallelization strategy to see this behavior, which most users don't do unless they have a specific reason to. Co-authored-by: Corey Ostrove <ciostro@sandia.gov>
Note: For some reason the automatic merge from bugfix into develop failed on PR #660, so this PR is just manually performing that merge. While playing around with MPI parallelization across parameters (instead of circuits, which the default for the map simulator as of 0.9.13) I ran into a bug which I tracked down to a shared memory bug inadvertently introduced with PR #515. Attached is a script, which if run using MPI (I was using 4 ranks) runs a single iteration of two-qubit GST on a single thread and on all four threads and logs the result for comparison. [debug_dist_objective_fn.py](https://github.com/user-attachments/files/22715098/debug_dist_objective_fn.py). Running this script on 0.9.14.1 yields the following output: ``` Number of Ranks is 4 Fit Log Single Rank ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (1,) MapLayout: 1 processors divided into 1 x 1 (= 1) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (1 cores) *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 1 param-processors. *** Divided 1-host atom-processor (~1 procs) into 1 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 4.5053e+09, mu=1, |x|=0, |J|=3.86694e+08 - Inner Loop: mu=1.542e+12, norm_dx=4.69768e-06 (cont): norm_new_f=1289.29, dL=4.42172e+09, dF=4.5053e+09, reldL=0.981449, reldF=1 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=1.0189 mu * 0.333333 => 5.14e+11 Least squares message = Maximum iterations (1) exceeded _objfn = 1289.29 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.4s Iteration 1 took 1.5s Last iteration: --- logl GST --- --- Outer Iter 0: norm_f = 644.825, mu=1, |x|=0.00216741, |J|=556063 - Inner Loop: mu=3.08599e+06, norm_dx=8.62909e-06 (cont): norm_new_f=549.175, dL=95.681, dF=95.65, reldL=0.148383, reldF=0.148335 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.999676 mu * 0.333333 => 1.02866e+06 Least squares message = Maximum iterations (1) exceeded _objfn = 1098.35 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.3s Final optimization took 1.3s Fit Log Four Ranks ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (4,) MapLayout: 4 processors divided into 1 x 4 (= 4) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (4 cores) *** *** Using shared memory to communicate between the 4 cores on each of 1 hosts *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 4 param-processors. *** Divided 1-host atom-processor (~4 procs) into 4 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 6381.58, mu=1, |x|=0, |J|=3.66935e+11 - Inner Loop: mu=1.44141e+18, norm_dx=4.27565e-18 (cont): norm_new_f=6177.24, dL=4871.93, dF=204.333, reldL=0.763437, reldF=0.0320192 Accepted! gain ratio=0.0419408 mu * 0.3 => 4.32423e+17 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Least squares message = Maximum iterations (1) exceeded _objfn = 6177.24 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Iteration 1 took 0.6s Last iteration: --- logl GST --- C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec --- Outer Iter 0: norm_f = 2617.65, mu=1, |x|=2.06777e-09, |J|=6.23454e+11 - Inner Loop: mu=4.05554e+18, norm_dx=3.83592e-19 (cont): norm_new_f=2605.54, dL=1399.13, dF=12.1112, reldL=0.534499, reldF=0.00462674 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.00865623 mu * 0.3 => 1.21666e+18 Least squares message = Maximum iterations (1) exceeded _objfn = 5211.07 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Final optimization took 0.5s ``` Obviously looking at the final objective function values the results are quite different, but what caught my eye is that this was true even at the very initial step before the optimizer had moved at all: Single threaded: norm_f = 4.5053e+09 Multi-threaded: norm_f = 6381.58 Shared memory is a real mind-bender, so I'll spare folks the details on how I tracked this down, but ultimately I tracked this down to [this line](https://github.com/sandialabs/pyGSTi/blob/339c8a3df6e84c9a154e04d42e090ab9d27f78b2/pygsti/objectivefns/objectivefns.py#L4415). The bug here arises because in the multi-threaded setting `lsvec` is an instance of `LocalNumpyArray`, which generally points to an array that lives in shared memory. The square-root on that line was not being guarded, and since each rank was pointing to the same memory that resulted in the square-root being applied as many as `num_ranks` times (in my testing this was actually nondeterministic, though usually 3 or 4, probably a race condition in play). I patterned matched off of other parts of `objectivefns.py` where shared memory is being used (mainly `terms` in `TimeIndependentMDCObjectiveFunction`) and think I got the correct guards added in (the patch seemed to work in my testing). I added back in a few comments that were removed in #515 related to the use of shared memory in this part of the code which I found useful in debugging this. P.S. As for why this wasn't caught sooner, the default forward simulator since 0.9.13 is `MapForwardSimulator`, and the default paralellization strategy for map is across circuits. It happens that when parallelizing only across circuits the portions of the shared memory being indexed into by each rank are non-overlapping, but that isn't true when parallelizing across parameters. So you'd have had to have manually changed the parallelization strategy to see this behavior, which most users don't do unless they have a specific reason to.










TimeIndependentMDCObjectiveFunction is one of the most complicated classes I've encountered in pyGSTi. Right now ...
This PR simplifies TimeIndependentMDCObjectiveFunction so that
These changes should materially improve maintainability of TimeIndependentMDCObjectiveFunction, as well as its 8 subclasses that collectively span ~500 lines. The outward behavior of this class should be the same as before up to floating point rounding errors.
For the curious, I went down this rabbit hole when working on PR #506.
Incidental changes related to MPI testing: moved some lightweight MPI tests from
test/test_packagestotest/unit. This move was prompted by this PR because this PR touches code that involves a lot of MPI. While I was at it I renamed an MPI performance profiling file so that it didn't get misinterpreted as a test file.