Restart files are different for CLM when run over different number of tasks #81

ekluzek · 2017-12-16T20:27:54Z

Erik Kluzek < erik > - 2011-03-31 10:24:06 -0600
Bugzilla Id: 1310
Bugzilla CC: dlawren, jedwards, jshollen, mvertens, oleson, rfisher, sacks,

This is from Bill Sacks...

With Mariana's help, I believe I have uncovered a minor bug in the urban model; this came up while testing the new CLM multi-instance code that I have been working on. This appears as a difference in the lnd restart files depending on the number of processors. It's possible that this is unimportant, but I thought I'd let you know anyway.

In particular, the two variables albgrd and albgri differ in some urban landunits in the CLM restart files. I have confirmed this with the latest clm tag (clm4_0_26), doing a 5-day run with resolution f19_g16, and comparing results using 64 vs. 16 tasks for the land model. You can see the output of cprnc in /ptmp/sacks/archive/clm4_0_26.init.quarterPEs/rest/0001-01-06-00000/cprnc.out

I believe that what is going on is the following:

(1) In UrbanMod.F90: UrbanAlbedo: A count is made of the number of urban landunits with coszen > 0 (num_solar); note that this count is just of the number of landunits that this processor is responsible for; thus, this is where the # PE-dependence comes in, I think.

(2) Later in that subroutine, a bunch of calculations are done if num_solar > 0 -- i.e., if this PE is responsible for at least one urban landunit with coszen > 0. Note that many of these calculations are done for all landunits, even ones for which coszen = 0. This introduces the possibility for different results depending on the decomposition.

(3) The particular difference that I am seeing is in albgrd & albgri. These are initialized to 0 at the start of the subroutine, and so remain 0 on any PE for which num_solar = 0. However, for PEs with num_solar > 0, landunits that have coszen = 0 end up getting albgrd = albgri = 1. This is because the calculation of albgrd & albgri depends on the values of the sref_* variables, which are initialized to 1 (and stay at 1 for any landunit for which coszen = 0).

ekluzek · 2017-12-16T20:28:00Z

Erik Kluzek < erik > - 2011-03-31 10:24:42 -0600

Also indices seem to change...

Another bug in the CLM restart files came up in my discussion with Mariana, and she asked that I pass this along to you. It seems that the indices, like pfts1d_li, etc., in the clm restart files depend on the decomposition, and are possibly incorrect.

ekluzek · 2017-12-16T20:28:05Z

Erik Kluzek < erik > - 2011-04-01 10:44:04 -0600

I got a confirmation of this problem for a test I ran on the crop branch where I see the following differences...

034 erL83 TER.sh _nrsc_do clm_std^nl_urb 20020115:3600 5x5_amazon navy -5+-5 arb_ic .............FAIL! rc= 13

Everything's bit-for-bit up to...

CLM_compare.sh: comparing clmrun.clm2.h1.2002-01-20-00000.nc
with /ptmp/erik/test-driver.888958/TSM._nrsc_do.clm_std^nl_urb.20020115:3600.5x5_amazon.navy.-10.arb_ic/clmrun.clm2.h1.2002
-01-20-00000.nc
CLM_compare.sh: files are NOT b4b

RMS land1d_g 7
RMS cols1d_g 7
RMS cols1d_l 8
RMS pfts1d_g 7
RMS pfts1d_l 8
RMS pfts1d_c 12

and the same pattern to the last h1 file on 2002-01-25.

ekluzek · 2017-12-16T20:28:11Z

Erik Kluzek < erik > - 2011-04-01 10:50:22 -0600

In talking about this with Mariana, she thought we should probably comment out
the writing of the indices -- as they are just wrong. I think they've been
wrong for a long while as well. She couldn't use them in interpinic in her
recent work and I remember having the same problem working on it a long time
ago as well.

She also wondered if the albedo difference would be a problem if you used a
restart file with a different start date (so that you might have points where
the sun is in a different position).

She did agree that we don't want to compare restart files in practice -- we
want to maintain the integrity of the history files. But, she does want to
pursue this issue a bit further to make sure we understand it.

ekluzek · 2017-12-16T20:28:17Z

Erik Kluzek < erik > - 2011-04-01 11:17:02 -0600

I'm seeing the same problem for the following tests on the crop branch on bluefire...

002 erA91 TER.sh _sc_dh clm_std^nl_urb 20030101:3600 4x5 gx3v7 -3+-3 arb_ic .....................FAIL! rc= 13
051 erH52 TER.sh 17p_cnsc_dm clm_std^nl_urb 20020115:1800 10x15 USGS@2000 10+38 cold ............FAIL! rc= 13
072 erJ61 TER.sh 4p_casasc_dh clm_std^nl_urb 20021230:1800 1.9x2.5 gx1v6 10+38 cold .............FAIL! rc= 13

the differences are all in the indices again. This must be something that shows up in a very recent version of clm. This problem would've showed up previously otherwise.

The documentation in the ChangeLog shows these tests passing even back to clm4_0_22. I'm not sure what to think here. I'll rerun one of these for clm4_0_26 and clm4_0_25 and see what I get...

ekluzek · 2017-12-16T20:28:22Z

Erik Kluzek < erik > - 2011-04-01 11:40:54 -0600

OK, so the erL83 test PASSes on clm4_0_24 and clm4_0_25 -- but fails on clm4_0_26.

ekluzek · 2017-12-16T20:28:28Z

Erik Kluzek < erik > - 2011-04-01 12:51:12 -0600

The code differences between 25 and 26 were trivial, and didn't explain the problem. It turns out what does -- is the version of cprnc used. In 26 the newer version is used and apparently it compares the indices -- but the earlier one didn't. When I run the same test for older version of clm (clm4_0_25 and even clm4_0_14) -- I get fails on the tests when using the newer cprnc (updating CLM_compare.sh and test_driver.sh).

So this is a long-standing problem that just shows up due to the newer version of cprnc. When I was figuring out the issues with the newer cprnc in the clm4_0_26 tag, I must not have rerun the tests that passed before.

ekluzek · 2017-12-16T20:28:33Z

Erik Kluzek < erik > - 2011-04-01 13:55:27 -0600

OK, so I talked about this with Keith and Sam. First of all in principle the indices are important and 1D vector mode is important. However, Mariana worked with Sam and showed him indices that were wrong and verified he wasn't using them. interpinic also doesn't use them as they are incorrect. You can derive them from other indices that do exist.

I also verified that with the new cprnc, I get my test to PASS when I don't write out the fields: land1d_gi, cols1d_li, cols1d_gi, pfts1d_gi, pfts1d_li, and pfts1d_ci in histFileMod.F90 and subgridRestMod.F90. So for now that's what I'll do. But, at some point we may want to fix this in the future. I'll continue to do work on it.

ekluzek · 2017-12-16T20:28:39Z

Erik Kluzek < erik > - 2011-05-02 14:35:42 -0600

In clm4_0_27 -- I comment out writing out the fields that are wrong. But, we should figure out why they are wrong.

ekluzek · 2017-12-16T20:28:44Z

Bill Sacks < sacks > - 2013-08-31 06:35:23 -0600

I noticed that

    int pfts1d_ci(pft) ;

is still in the restart file; should this be removed like the other indices were?

ekluzek · 2017-12-16T20:28:50Z

Bill Sacks < sacks > - 2013-08-31 06:44:25 -0600

Regarding this: Also indices seem to change... (land1d_g, cols1d_g, etc.):

I was thinking about this issue with respect to the 1-d vector reordering that was done recently in the clm4.5 code (in clm4_5_21, I believe). I believe (but may be wrong) that the indices in memory do not correspond to indices on the file. I'm not sure this explains why these depended on the decomposition before, but I think they will almost certainly depend on the decomposition now.

The issue, I think, is that the arrangement in memory (during a run) does depend on the decomposition - both on the number of processors and on the number of threads (or more precisely, the number of clumps per processor). Mariana went to some pains to make the arrangement of points on the restart file independent of decomposition. However, I think that land1d_g, cols1d_g, etc. refer to the indices in MEMORY, not the indices on the file.

If these variables are desired on the restart file - and 1-d history files (with dov2xy = .false.), a possible workaround would be to also write variables like the following: land1d_l, cols1d_c, etc. These would give the index of each point in memory. Note that we do NOT expect cols1d_c(1) to be 1 - again, because the arrangement on the file differs from the arrangement in memory. Furthermore, these variables - like the other indices - would differ depending on the decomposition. But at least it would let you do analyses using these variables.

ekluzek · 2017-12-16T20:28:56Z

Erik Kluzek < erik > - 2013-08-31 11:08:02 -0600

Something we gave up with the move from CLM stand-alone testing to CESM testing was the change of number of processors on restart. We assume that the model is able to do this and startup for any number of processors, regardless of how many processors the restart file was written with.

I think this issue underscores the importance of this issue and of adding testing for that ability into CESM testing (at least for components like CLM where this is expected).

It would be good to hear if the current model gives the same answers if you restart with a different number of processors.

ekluzek · 2017-12-16T20:29:01Z

Bill Sacks < sacks > - 2013-08-31 15:37:42 -0600

Regarding Erik's last comment: I think we should be okay if we have a PEM (and once threading is fixed, a PET) test that starts up from initial conditions. I haven't checked if the current tests do so.

ekluzek · 2017-12-16T20:29:06Z

Erik Kluzek < erik > - 2017-03-21 10:27:00 -0600

Sean Swenson has this to say...

hi, erik,
I think the solution to this problem in histfileMod:

   ! --- EBK Do NOT write out indices that are incorrect 4/1/2011 --- Bug 1310

can be fixed using the code from subgridRestMod:

    do p=bounds%begp,bounds%endp
       iparr(p) = GetGlobalIndex(decomp_index=patch%column(p), clmlevel=namec)
    enddo
    call restartvar(ncid=ncid, flag=flag, varname='pfts1d_column_index', xtype=ncd_int,   &
         dim1name='pft',                                                                  &
         long_name='column index of corresponding pft',                                   &
         interpinic_flag='skip', readvar=readvar, data=iparr)

it seems to work for me offline when I use the pfts1d_column_index from the restart file instead of the pfts1d_ci from the history file. I haven't actually implemented code in histfileMod.

ekluzek · 2018-07-19T20:38:05Z

@swensosc showed me his fix to this problem in the hillslope_hydrology branch. The bit of code in histFileMod is like this

       do p=bounds%begp,bounds%endp
          iparr(p) = GetGlobalIndex(decomp_index=patch%column(p), clmlevel=namec)
       enddo
       call ncd_io(varname='pfts1d_ci'       , data=iparr  , dim1name=namep, ncid=ncid, flag='write')

As far as he can tell it's doing the right thing. There's a possibility he hasn't done enough checking, but it's not demonstrably wrong like we saw before.

There's a different transformation that as attempted before that apparently is incorrect.

We'd like to fix this right away as not having this on output is causing problems, you have to assume order and assume the correct one.

New options for irrigation and crop fsat Introduce three new options: (1) Ability to withdraw irrigation water from groundwater if not enough water is available from rivers. This is controlled via new namelist flag, use_groundwater_irrigation. Water can be withdrawn from both the unconfined (from the soil column) and confined (from wa) aquifers. (2) Irrigation method: sprinkler (above canopy) vs. drip (below canopy). This can be set on a per-crop and per-gridcell basis on the surface dataset, but out-of-the-box support for creating the necessary surface dataset field is not yet in place (see #565). For now, it can be controlled globally via a new namelist flag, irrig_method_default. The default is drip, which was what we were previously using implicitly. (3) Set crop fsat to zero. This is controlled by a new namelist option, crop_fsat_equals_zero. Default behavior is the same as before for all three options. Also: - If use_aquifer_layer is false (which is the default for CLM50), no longer reset wa_col every time step - Adds indices to vector history files giving column, landunit and gridcell indices for each patch, etc. (Resolves #81: "Restart files are different for CLM when run over different number of tasks" (issue name is a misnomer of the remaining to-dos in that issue.) - Writes 3d time-constant fields on first history file for all tapes, not just the first - Small performance improvements to irrigation

billsacks · 2018-12-03T19:00:44Z

@ekluzek I see you put milestone cesm2.1.1 on this. Note that it was fixed in ctsm1.0.dev020, but that tag won't come to the release branch, so if you want it on the release branch, the relevant changes in histFileMod will need to be ported over.

New options for irrigation and crop fsat Introduce three new options: (1) Ability to withdraw irrigation water from groundwater if not enough water is available from rivers. This is controlled via new namelist flag, use_groundwater_irrigation. Water can be withdrawn from both the unconfined (from the soil column) and confined (from wa) aquifers. (2) Irrigation method: sprinkler (above canopy) vs. drip (below canopy). This can be set on a per-crop and per-gridcell basis on the surface dataset, but out-of-the-box support for creating the necessary surface dataset field is not yet in place (see ESCOMP#565). For now, it can be controlled globally via a new namelist flag, irrig_method_default. The default is drip, which was what we were previously using implicitly. (3) Set crop fsat to zero. This is controlled by a new namelist option, crop_fsat_equals_zero. Default behavior is the same as before for all three options. Also: - If use_aquifer_layer is false (which is the default for CLM50), no longer reset wa_col every time step - Adds indices to vector history files giving column, landunit and gridcell indices for each patch, etc. (Resolves ESCOMP#81: "Restart files are different for CLM when run over different number of tasks" (issue name is a misnomer of the remaining to-dos in that issue.) - Writes 3d time-constant fields on first history file for all tapes, not just the first - Small performance improvements to irrigation

Set the default DATM_YR_* variables for 1PT streams to 2018 to 2019. For the NEON streams use the $DATM_YR_* settings rather than hardcoded values. This allows the case settings of the DATM_YR_* variables change the datm.streams.xml file. ### Description of changes ### Specific notes Contributors other than yourself, if any: CMEPS Issues Fixed (include github issue #): Fixes ESCOMP#78 Are there dependencies on other component PRs: Yes CTSM ESCOMP#1278 Are changes expected to change answers? No - [x] bit for bit Any User Interface Changes (namelist or namelist defaults changes)? No - [x ] No Testing performed: Ran SMS_D_Vnuopc.CLM_USRDAT.I1PtClm51Bgc.cheyenne_intel.clm-NEON_NIWO which worked and PASSes. Hashes used for testing: - [ ] CIME. cime5.6.43 - [ ] CMEPS v0.10.0 - [ ] CTSM: ctsm5.1.dev036-31-geb5f2e309 (NEON-compset PR ESCOMP#1278)

ekluzek added this to the clm5 milestone Dec 16, 2017

ekluzek self-assigned this Jul 19, 2018

ekluzek added enhancement new capability or improved behavior of existing capability priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations labels Jul 19, 2018

ekluzek modified the milestones: clm5, cmip6 Jul 19, 2018

billsacks mentioned this issue Sep 27, 2018

Groundwater irrigation #523

Merged

ekluzek modified the milestones: cmip6, cesm2.1.1 Nov 5, 2018

billsacks closed this as completed in #523 Dec 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart files are different for CLM when run over different number of tasks #81

Restart files are different for CLM when run over different number of tasks #81

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017 •

edited

Loading

ekluzek commented Jul 19, 2018

billsacks commented Dec 3, 2018

Restart files are different for CLM when run over different number of tasks #81

Restart files are different for CLM when run over different number of tasks #81

Comments

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017

ekluzek commented Dec 16, 2017 • edited Loading

ekluzek commented Jul 19, 2018

billsacks commented Dec 3, 2018

ekluzek commented Dec 16, 2017 •

edited

Loading