-
Notifications
You must be signed in to change notification settings - Fork 353
Fix bug in determining when decompositions can be reused by the SMIOL library #1288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in determining when decompositions can be reused by the SMIOL library #1288
Conversation
… library The MPAS_io_set_var_indices() routine in mpas_io.F contains logic to determine whether a decomposition can be reused from the set of existing decompositions; this can save time, as the creation of a new decomposition structure can incur a non-trivial cost. The checks to determine whether a SMIOL decomposition can be reused previously employed the following criteria: * The number of compute offsets (indices) on an MPI rank matches * The offsets themselves for the MPI rank all agreed These two criteria alone left open the possibility that all MPI ranks may individually find apparently compatible decompositions, but that the decomposition selected by each MPI rank may be different, leading to I/O errors. Consider the following two decompositions with sets of compute offsets on two MPI ranks: MPI rank 0 MPI rank 1 Decomp 1: {0,2} {1} Decomp 2: {0} {1} If a new field in which rank 0 has offset {0} and rank 1 has offset {1} was considered, MPI rank 0 would select decomp 2 as a matching decomposition, while MPI rank 1 would select decomp 1 (because decompositions are tested in order). Now, a third criterion has been added to the checks, namely, that the decomposition must have the same global number of offsets. It is worth noting that this new criterion is a necessary condition for correctly reusing decompositions, but combined with the existing criteria, may not be sufficient to guarantee correct reuse of decompositions under all possible conditions.
@abishekg7 @gdicker1 One way to induce a failure in the current release code is to run on a mesh with 12 cells using 12 MPI tasks. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Tested the failing case with the current master
(using the 12-cell grid), and seems to fail with this message Message from rank 0 and tag 5 truncated; 4 bytes received but buffer size is 8
. Whereas the current PR run finishes successfully. The output netcdf is bit-identical between the serial and 12-rank runs.
This merge addresses several issues in the MPAS-Atmosphere model and in the MPAS infrastructure. Specific changes include: * Correction of the pool from which lbc_scalar constituent indices are obtained in the init_atm_thompson_aerosols_lbc routine. Rather than obtaining index_nifa and index_nwfa from the state pool, the indices of lbc_nifa and lbc_nwfa should be obtained from the lbc_state pool. (PR #1249) * Correction to the computation of the soil temperature (TSLB) in the Noah-MP land surface scheme through the addition of initialization of the soil liquid water (SH2O) in the noahmp_init subroutine in module mpas_atmphys_lsm_noahmpinit.F prior to calling NoahmpInitMain. (PR #1244) * Correction of the units of the fields 'greenfrac', 'shdmin', 'shdmax', 'vegfra', and 'albedo12m' from "unitless" to "percent" in the init_atmosphere and atmosphere core Registry.xml files. Also, a correction to the spelling of 'greenness' in several places. (PR #1248) * Removal of a duplicate allocation of indexToEdgeID % array in the mpas_io_setup_edge_block_fields routine that was the source of a memory leak. (PR #1258) * Fix for a memory leak in mpas_block_creator_build_cell_halos by deallocating the cellLimitField field before the routine returns. (PR #1264) * Fix for a bug in the logic for determining when decompositions can be reused by the SMIOL library. In almost any practical situation, however, this bug created no issues. (PR #1288) * Changes in the init_atmosphere core to provide more reliable error messages in case config_nfglevels is not set to a value that is at least as large as the number of vertical levels in the first-guess intermediate file. (PR #1291) * Correction of the loop for Noah-MP snow initialization, capping snow water equivalent maximum at 2000 mm. (PR #1300) * Fix for a bug in the horizontal 2nd-order filter for the CAM upper absorbing layer, where the wrong level in the kdiff field was being used when enforcing a lower-bound on kdiff. This absorbing layer is active only when config_mpas_cam_coef > 0.0. (PR #1302) * Fix in the mountain wave idealized test case initialization when multiple MPI tasks are used. The 'xc' variable, which represents the center-point location of the mountain, was previously computed based on the maximum xCell values local to an MPI task, leading to inconsistent values on each MPI rank. By finding the maximum of xCell over all MPI ranks and ensuring that all MPI ranks use this global maximum, the terrain field is computed consistently between serial and parallel runs of the init_atmosphere_model program for the mountain wave test case (config_init_case = 6). (PR #1312) * Correction to the calculation of the 2-meter diagnostics (T2M, TH2M, and Q2) when using the Noah-MP land surface scheme. While the computation of 2-meter diagnostics is the same for Noah and Noah-MP over oceans, it is different between the two land surface schemes over land. In Noah-MP, the 2-meter diagnostics are weighted as functions of their respective diagnostics over bare soil and over vegetation. The updated diagnostics for Noah and Noah-MP are now computed in the new file mpas_atmphys_sfc_diagnostics.F. (PR #1242) * Fix to provide consistency in the ringing behavior of recurring alarms after their reference time has been adjusted with a call to mpas_adjust_alarm_to_reference_time. Now, adjusting the reference time for an alarm will always leave that alarm in a state such that it is considered by the mpas_is_alarm_ringing routine to be ringing at the current time. With this fix, limited-area simulations can be restarted at times between LBC updates, provided the reference_time attribute for the 'lbc_in' stream is set to the simulation initial time in the streams.atmosphere file. (PR #1290). * Correction of an indexing error for rvcuten in code blocks specific to the Grell-Freitas scheme in the convection driver. Specifically, in the convection_from_MPAS and convection_to_MPAS routines, rvcuten used (k,k) as indexing in a loop, where (k,i) is needed. Since the Grell-Freitas scheme does not provide momentum tendencies, the changes in this merge have no impact on results. (PR #1283)
This PR fixes a bug in logic for determining when decompositions can be reused by the SMIOL library.
The
MPAS_io_set_var_indices()
routine inmpas_io.F
contains logic to determine whether a decomposition can be reused from the set of existing decompositions; this can save time, as the creation of a new decomposition structure can incur a non-trivial cost.The checks to determine whether a SMIOL decomposition can be reused previously employed the following criteria:
These two criteria alone left open the possibility that all MPI ranks may individually find apparently compatible decompositions, but that the decomposition selected by each MPI rank may be different, leading to I/O errors.
Consider the following two decompositions with sets of compute offsets on two MPI ranks:
If a new field in which rank 0 has offset {0} and rank 1 has offset {1} was considered, MPI rank 0 would select decomp 2 as a matching decomposition, while MPI rank 1 would select decomp 1 (because decompositions are tested in order).
Now, a third criterion has been added to the checks, namely, that the decomposition must have the same global number of offsets. It is worth noting that this new criterion is a necessary condition for correctly reusing decompositions, but combined with the existing criteria, may not be sufficient to guarantee correct reuse of decompositions under all possible conditions. Nonetheless, the addition of the third criterion enables the correct selection of decompositions under conditions
that previously resulted in failure.