-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart files are different for CLM when run over different number of tasks #81
Comments
Erik Kluzek < erik > - 2011-03-31 10:24:42 -0600 Also indices seem to change... Another bug in the CLM restart files came up in my discussion with Mariana, and she asked that I pass this along to you. It seems that the indices, like pfts1d_li, etc., in the clm restart files depend on the decomposition, and are possibly incorrect. |
Erik Kluzek < erik > - 2011-04-01 10:44:04 -0600 I got a confirmation of this problem for a test I ran on the crop branch where I see the following differences... 034 erL83 TER.sh _nrsc_do clm_std^nl_urb 20020115:3600 5x5_amazon navy -5+-5 arb_ic .............FAIL! rc= 13 Everything's bit-for-bit up to... CLM_compare.sh: comparing clmrun.clm2.h1.2002-01-20-00000.nc RMS land1d_g 7 and the same pattern to the last h1 file on 2002-01-25. |
Erik Kluzek < erik > - 2011-04-01 10:50:22 -0600 In talking about this with Mariana, she thought we should probably comment out She also wondered if the albedo difference would be a problem if you used a She did agree that we don't want to compare restart files in practice -- we |
Erik Kluzek < erik > - 2011-04-01 11:17:02 -0600 I'm seeing the same problem for the following tests on the crop branch on bluefire... 002 erA91 TER.sh _sc_dh clm_std^nl_urb 20030101:3600 4x5 gx3v7 -3+-3 arb_ic .....................FAIL! rc= 13 the differences are all in the indices again. This must be something that shows up in a very recent version of clm. This problem would've showed up previously otherwise. The documentation in the ChangeLog shows these tests passing even back to clm4_0_22. I'm not sure what to think here. I'll rerun one of these for clm4_0_26 and clm4_0_25 and see what I get... |
Erik Kluzek < erik > - 2011-04-01 11:40:54 -0600 OK, so the erL83 test PASSes on clm4_0_24 and clm4_0_25 -- but fails on clm4_0_26. |
Erik Kluzek < erik > - 2011-04-01 12:51:12 -0600 The code differences between 25 and 26 were trivial, and didn't explain the problem. It turns out what does -- is the version of cprnc used. In 26 the newer version is used and apparently it compares the indices -- but the earlier one didn't. When I run the same test for older version of clm (clm4_0_25 and even clm4_0_14) -- I get fails on the tests when using the newer cprnc (updating CLM_compare.sh and test_driver.sh). So this is a long-standing problem that just shows up due to the newer version of cprnc. When I was figuring out the issues with the newer cprnc in the clm4_0_26 tag, I must not have rerun the tests that passed before. |
Erik Kluzek < erik > - 2011-04-01 13:55:27 -0600 OK, so I talked about this with Keith and Sam. First of all in principle the indices are important and 1D vector mode is important. However, Mariana worked with Sam and showed him indices that were wrong and verified he wasn't using them. interpinic also doesn't use them as they are incorrect. You can derive them from other indices that do exist. I also verified that with the new cprnc, I get my test to PASS when I don't write out the fields: land1d_gi, cols1d_li, cols1d_gi, pfts1d_gi, pfts1d_li, and pfts1d_ci in histFileMod.F90 and subgridRestMod.F90. So for now that's what I'll do. But, at some point we may want to fix this in the future. I'll continue to do work on it. |
Erik Kluzek < erik > - 2011-05-02 14:35:42 -0600 In clm4_0_27 -- I comment out writing out the fields that are wrong. But, we should figure out why they are wrong. |
Bill Sacks < sacks > - 2013-08-31 06:35:23 -0600 I noticed that
is still in the restart file; should this be removed like the other indices were? |
Bill Sacks < sacks > - 2013-08-31 06:44:25 -0600 Regarding this: Also indices seem to change... (land1d_g, cols1d_g, etc.): I was thinking about this issue with respect to the 1-d vector reordering that was done recently in the clm4.5 code (in clm4_5_21, I believe). I believe (but may be wrong) that the indices in memory do not correspond to indices on the file. I'm not sure this explains why these depended on the decomposition before, but I think they will almost certainly depend on the decomposition now. The issue, I think, is that the arrangement in memory (during a run) does depend on the decomposition - both on the number of processors and on the number of threads (or more precisely, the number of clumps per processor). Mariana went to some pains to make the arrangement of points on the restart file independent of decomposition. However, I think that land1d_g, cols1d_g, etc. refer to the indices in MEMORY, not the indices on the file. If these variables are desired on the restart file - and 1-d history files (with dov2xy = .false.), a possible workaround would be to also write variables like the following: land1d_l, cols1d_c, etc. These would give the index of each point in memory. Note that we do NOT expect cols1d_c(1) to be 1 - again, because the arrangement on the file differs from the arrangement in memory. Furthermore, these variables - like the other indices - would differ depending on the decomposition. But at least it would let you do analyses using these variables. |
Erik Kluzek < erik > - 2013-08-31 11:08:02 -0600 Something we gave up with the move from CLM stand-alone testing to CESM testing was the change of number of processors on restart. We assume that the model is able to do this and startup for any number of processors, regardless of how many processors the restart file was written with. I think this issue underscores the importance of this issue and of adding testing for that ability into CESM testing (at least for components like CLM where this is expected). It would be good to hear if the current model gives the same answers if you restart with a different number of processors. |
Bill Sacks < sacks > - 2013-08-31 15:37:42 -0600 Regarding Erik's last comment: I think we should be okay if we have a PEM (and once threading is fixed, a PET) test that starts up from initial conditions. I haven't checked if the current tests do so. |
Erik Kluzek < erik > - 2017-03-21 10:27:00 -0600 Sean Swenson has this to say... hi, erik,
can be fixed using the code from subgridRestMod: do p=bounds%begp,bounds%endp
iparr(p) = GetGlobalIndex(decomp_index=patch%column(p), clmlevel=namec)
enddo
call restartvar(ncid=ncid, flag=flag, varname='pfts1d_column_index', xtype=ncd_int, &
dim1name='pft', &
long_name='column index of corresponding pft', &
interpinic_flag='skip', readvar=readvar, data=iparr) it seems to work for me offline when I use the pfts1d_column_index from the restart file instead of the pfts1d_ci from the history file. I haven't actually implemented code in histfileMod. |
@swensosc showed me his fix to this problem in the hillslope_hydrology branch. The bit of code in histFileMod is like this do p=bounds%begp,bounds%endp
iparr(p) = GetGlobalIndex(decomp_index=patch%column(p), clmlevel=namec)
enddo
call ncd_io(varname='pfts1d_ci' , data=iparr , dim1name=namep, ncid=ncid, flag='write') As far as he can tell it's doing the right thing. There's a possibility he hasn't done enough checking, but it's not demonstrably wrong like we saw before. There's a different transformation that as attempted before that apparently is incorrect. We'd like to fix this right away as not having this on output is causing problems, you have to assume order and assume the correct one. |
New options for irrigation and crop fsat Introduce three new options: (1) Ability to withdraw irrigation water from groundwater if not enough water is available from rivers. This is controlled via new namelist flag, use_groundwater_irrigation. Water can be withdrawn from both the unconfined (from the soil column) and confined (from wa) aquifers. (2) Irrigation method: sprinkler (above canopy) vs. drip (below canopy). This can be set on a per-crop and per-gridcell basis on the surface dataset, but out-of-the-box support for creating the necessary surface dataset field is not yet in place (see #565). For now, it can be controlled globally via a new namelist flag, irrig_method_default. The default is drip, which was what we were previously using implicitly. (3) Set crop fsat to zero. This is controlled by a new namelist option, crop_fsat_equals_zero. Default behavior is the same as before for all three options. Also: - If use_aquifer_layer is false (which is the default for CLM50), no longer reset wa_col every time step - Adds indices to vector history files giving column, landunit and gridcell indices for each patch, etc. (Resolves #81: "Restart files are different for CLM when run over different number of tasks" (issue name is a misnomer of the remaining to-dos in that issue.) - Writes 3d time-constant fields on first history file for all tapes, not just the first - Small performance improvements to irrigation
@ekluzek I see you put milestone cesm2.1.1 on this. Note that it was fixed in ctsm1.0.dev020, but that tag won't come to the release branch, so if you want it on the release branch, the relevant changes in histFileMod will need to be ported over. |
New options for irrigation and crop fsat Introduce three new options: (1) Ability to withdraw irrigation water from groundwater if not enough water is available from rivers. This is controlled via new namelist flag, use_groundwater_irrigation. Water can be withdrawn from both the unconfined (from the soil column) and confined (from wa) aquifers. (2) Irrigation method: sprinkler (above canopy) vs. drip (below canopy). This can be set on a per-crop and per-gridcell basis on the surface dataset, but out-of-the-box support for creating the necessary surface dataset field is not yet in place (see ESCOMP#565). For now, it can be controlled globally via a new namelist flag, irrig_method_default. The default is drip, which was what we were previously using implicitly. (3) Set crop fsat to zero. This is controlled by a new namelist option, crop_fsat_equals_zero. Default behavior is the same as before for all three options. Also: - If use_aquifer_layer is false (which is the default for CLM50), no longer reset wa_col every time step - Adds indices to vector history files giving column, landunit and gridcell indices for each patch, etc. (Resolves ESCOMP#81: "Restart files are different for CLM when run over different number of tasks" (issue name is a misnomer of the remaining to-dos in that issue.) - Writes 3d time-constant fields on first history file for all tapes, not just the first - Small performance improvements to irrigation
New options for irrigation and crop fsat Introduce three new options: (1) Ability to withdraw irrigation water from groundwater if not enough water is available from rivers. This is controlled via new namelist flag, use_groundwater_irrigation. Water can be withdrawn from both the unconfined (from the soil column) and confined (from wa) aquifers. (2) Irrigation method: sprinkler (above canopy) vs. drip (below canopy). This can be set on a per-crop and per-gridcell basis on the surface dataset, but out-of-the-box support for creating the necessary surface dataset field is not yet in place (see ESCOMP#565). For now, it can be controlled globally via a new namelist flag, irrig_method_default. The default is drip, which was what we were previously using implicitly. (3) Set crop fsat to zero. This is controlled by a new namelist option, crop_fsat_equals_zero. Default behavior is the same as before for all three options. Also: - If use_aquifer_layer is false (which is the default for CLM50), no longer reset wa_col every time step - Adds indices to vector history files giving column, landunit and gridcell indices for each patch, etc. (Resolves ESCOMP#81: "Restart files are different for CLM when run over different number of tasks" (issue name is a misnomer of the remaining to-dos in that issue.) - Writes 3d time-constant fields on first history file for all tapes, not just the first - Small performance improvements to irrigation
Set the default DATM_YR_* variables for 1PT streams to 2018 to 2019. For the NEON streams use the $DATM_YR_* settings rather than hardcoded values. This allows the case settings of the DATM_YR_* variables change the datm.streams.xml file. ### Description of changes ### Specific notes Contributors other than yourself, if any: CMEPS Issues Fixed (include github issue #): Fixes ESCOMP#78 Are there dependencies on other component PRs: Yes CTSM ESCOMP#1278 Are changes expected to change answers? No - [x] bit for bit Any User Interface Changes (namelist or namelist defaults changes)? No - [x ] No Testing performed: Ran SMS_D_Vnuopc.CLM_USRDAT.I1PtClm51Bgc.cheyenne_intel.clm-NEON_NIWO which worked and PASSes. Hashes used for testing: - [ ] CIME. cime5.6.43 - [ ] CMEPS v0.10.0 - [ ] CTSM: ctsm5.1.dev036-31-geb5f2e309 (NEON-compset PR ESCOMP#1278)
Erik Kluzek < erik > - 2011-03-31 10:24:06 -0600
Bugzilla Id: 1310
Bugzilla CC: dlawren, jedwards, jshollen, mvertens, oleson, rfisher, sacks,
This is from Bill Sacks...
With Mariana's help, I believe I have uncovered a minor bug in the urban model; this came up while testing the new CLM multi-instance code that I have been working on. This appears as a difference in the lnd restart files depending on the number of processors. It's possible that this is unimportant, but I thought I'd let you know anyway.
In particular, the two variables albgrd and albgri differ in some urban landunits in the CLM restart files. I have confirmed this with the latest clm tag (clm4_0_26), doing a 5-day run with resolution f19_g16, and comparing results using 64 vs. 16 tasks for the land model. You can see the output of cprnc in /ptmp/sacks/archive/clm4_0_26.init.quarterPEs/rest/0001-01-06-00000/cprnc.out
I believe that what is going on is the following:
(1) In UrbanMod.F90: UrbanAlbedo: A count is made of the number of urban landunits with coszen > 0 (num_solar); note that this count is just of the number of landunits that this processor is responsible for; thus, this is where the # PE-dependence comes in, I think.
(2) Later in that subroutine, a bunch of calculations are done if num_solar > 0 -- i.e., if this PE is responsible for at least one urban landunit with coszen > 0. Note that many of these calculations are done for all landunits, even ones for which coszen = 0. This introduces the possibility for different results depending on the decomposition.
(3) The particular difference that I am seeing is in albgrd & albgri. These are initialized to 0 at the start of the subroutine, and so remain 0 on any PE for which num_solar = 0. However, for PEs with num_solar > 0, landunits that have coszen = 0 end up getting albgrd = albgri = 1. This is because the calculation of albgrd & albgri depends on the values of the sref_* variables, which are initialized to 1 (and stay at 1 for any landunit for which coszen = 0).
The text was updated successfully, but these errors were encountered: