Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New CN matrix fails with single point sites with the new ctsm5.3 datasets. #2780

Open
3 tasks done
ekluzek opened this issue Sep 23, 2024 · 15 comments · May be fixed by #2840
Open
3 tasks done

New CN matrix fails with single point sites with the new ctsm5.3 datasets. #2780

ekluzek opened this issue Sep 23, 2024 · 15 comments · May be fixed by #2840
Assignees
Labels
bug something is working incorrectly priority: low Background task that doesn't need to be done right away. science Enhancement to or bug impacting science

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Sep 23, 2024

Since ctsm5.2.dev175 to ctsm5.3.0 we've been running tests with MIMICS with above ground CN matrtix that have been passing. The test is SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn. This has the soil CN matrix off (because MIMICS is non-linear), but above ground CN matrix on (use_soil_matrixcn = .false. use_matrixcn = .true.).

There are two reasons for doing this test:

  1. Hopefully get MIMICS to spinup faster with above ground matrix on
  2. More extensive testing of Matrix for an edge case where it might fail easier

The hope for "1" was especially there as we weren't finding methods to speed up the spinup of MIMICS. The test did pass for 30 tags, and just started failing in ctsm5.3.0 with the following type of error in the log files:

lnd.log:

 hist_htapes_wrapup : Closing local history file ./SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn.20240923_125029_ialh14.clm2.h1.0001-01-01-28800.nc at nstep =           16

(shr_strdata_readstrm) reading file ub: /glade/campaign/cesm/cesmdata/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2016_climo1995-2013.360x720.lnfm_Total_c160825.nc       7
 ERROR: ERROR in /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90 at line 1246

cesm.log:

dec0996.hsn.de.hpc.ucar.edu 0:  ERROR: ERROR in /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90 at line 1246
dec0996.hsn.de.hpc.ucar.edu 0: #0  0x12c3b50 in __shr_abort_mod_MOD_shr_abort_backtrace
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_abort_mod.F90:104
dec0996.hsn.de.hpc.ucar.edu 0: #1  0x12c3c13 in __shr_abort_mod_MOD_shr_abort_abort
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_abort_mod.F90:61
dec0996.hsn.de.hpc.ucar.edu 0: #2  0x131f9c8 in __shr_assert_mod_MOD_shr_assert
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_assert_mod.F90.in:95
dec0996.hsn.de.hpc.ucar.edu 0: #3  0xe38814 in __sparsematrixmultiplymod_MOD_spmp_abc
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90:1246
dec0996.hsn.de.hpc.ucar.edu 0: #4  0x8e97db in __cnvegmatrixmod_MOD_cnvegmatrix
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNVegMatrixMod.F90:1509
dec0996.hsn.de.hpc.ucar.edu 0: #5  0x10466ef in __cndrivermod_MOD_cndriverleaching
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNDriverMod.F90:1098
dec0996.hsn.de.hpc.ucar.edu 0: #6  0x92a6b2 in __cnvegetationfacade_MOD_ecosystemdynamicspostdrainage
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNVegetationFacade.F90:1125
dec0996.hsn.de.hpc.ucar.edu 0: #7  0x5d7ed6 in __clm_driver_MOD_clm_drv
dec0996.hsn.de.hpc.ucar.edu 0: 	at /glade/work/erik/ctsm_worktrees/answer_changes/src/main/clm_driver.F90:1119

The line it fails on from above is the SHR_ASSERT_FL in this section of code in SparseMatrixMultiplyMod.F90:

    if(present(num_actunit_C))then
       if(num_actunit_C < 0)then
          write(iulog,*) "error: num_actunit_C cannot be less than 0"
          call endrun( subname//" ERROR: bad value for num_actunit_C" )
          return
       end if
       if(.not. present(filter_actunit_C))then
          write(iulog,*) "error: num_actunit_C is presented but filter_actunit_C is missing"
          call endrun( subname//" ERROR: missing required optional arguments" )
          return
       end if
       SHR_ASSERT_FL((size(filter_actunit_C) > num_actunit_C), sourcefile, __LINE__)
    end if

The call in CNVegMatrixMod.F90 is here:

         if(num_actfirep .eq. 0 .and. nthreads < 2)then
            call AKallvegc%SPMP_AB(num_soilp,filter_soilp,AKphvegc,AKgmvegc,list_ready_phgmc,list_A=list_phc_phgm,list_B=list_gmc_phgm,&
                 NE_AB=NE_AKallvegc,RI_AB=RI_AKallvegc,CI_AB=CI_AKallvegc)
         else
            call AKallvegc%SPMP_ABC(num_soilp,filter_soilp,AKphvegc,AKgmvegc,AKfivegc,list_ready_phgmfic,list_A=list_phc_phgmfi,&
                 list_B=list_gmc_phgmfi,list_C=list_fic_phgmfi,NE_ABC=NE_AKallvegc,RI_ABC=RI_AKallvegc,CI_ABC=CI_AKallvegc,&
                 use_actunit_list_C=.True.,num_actunit_C=num_actfirep,filter_actunit_C=filter_actfirep)
         end if

Definition of done:

  • FAIL: Test if works for cold start
  • NO: Assess if should add a short f10 test and make sure it works
  • Change accordingly to what is found out from above
@ekluzek ekluzek added closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix priority: low Background task that doesn't need to be done right away. bug something is working incorrectly science Enhancement to or bug impacting science labels Sep 23, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Sep 23, 2024

This is the only test we have for mimics_matrixcn. It's also possible that the tests that passed would fail if run out far enough.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Sep 23, 2024

Here's the note about this test when it was added.

#640 (comment)

I'm also doing some longer and different tests in ctsm5.2.028 to see the test just happened to pass since it was too short. As well as making sure the same test works without MIMCS.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Sep 23, 2024

Longer tests and tests at f10 in ctsm5.2.028 seem to be fine.

SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn
SMS_D.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn
SMS_D_Lm1.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn
SMS_Ly2.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn

So maybe there is something specific about this with ctsm5.3.0 datasets.

We'll mark this as an expected fail for now though.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Sep 24, 2024

The other test that fails in the same way is:

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn

@slevis-lmwg
Copy link
Contributor

...and SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.izumi_nag.clm-default--clm-NEON-HARV--clm-matrixcnOn

@ekluzek ekluzek changed the title Limitation: MIMICS with above ground CN matrix New CN matrix fails with single point sites with the new ctsm5.3 datasets. Sep 25, 2024
@ekluzek ekluzek added next this should get some attention in the next week or two. Normally each Thursday SE meeting. and removed priority: low Background task that doesn't need to be done right away. labels Sep 25, 2024
@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Sep 25, 2024

My gut feeling is that these tests need new finidat files, based on past experiences where CNmatrix has crashed with one finidat and not with another (#2592).

E.g. the nearest neighbor from the finidat may not contain the right pft combinations needed for these single-point simulations.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Sep 26, 2024

In one of the failing tests, I changed finidat from
ctsm52026_f09_pSASU.clm2.r.0421-01-01-00000.nc
to
clmi.f19_interp_from.I1850Clm50BgcCrop-ciso.1366-01-01.0.9x1.25_gx1v7_simyr1850_c240223.nc
and the test failed in a different timestep.

Next I want to try setting finidat to the interpolated file saved in
.../tests_0923-141750de/SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn.GC.0923-141750de_gnu/run/init_generated_files/
Hmm, but that may do nothing to help. I may need to generate a new finidat for this point starting from a cold start simulation.

@samsrabin samsrabin added this to the cesm3_0_beta04 milestone Sep 26, 2024
@slevis-lmwg slevis-lmwg self-assigned this Sep 26, 2024
@ekluzek ekluzek removed the closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix label Sep 27, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Sep 27, 2024

A broader question we wonder here (@slevis-lmwg and I) for the group to assess: (discussed at CTSM SE Oct/10th/2024)

  • Should we provide IC files for single point sites? No, except NEON
  • Just some (like NEON), just the ones we test for, or all? All NEON would be good. Currently process creates them outside of tags though, and may be fine for now.
  • When we run matrix and run into problems like this -- do we fix it with updated IC files as a practice? Only for global grids. For single point, just change to a cold-start.

@ekluzek ekluzek added the priority: low Background task that doesn't need to be done right away. label Sep 27, 2024
@wwieder wwieder removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Oct 10, 2024
@wwieder
Copy link
Contributor

wwieder commented Oct 10, 2024

maybe matrix tests always need to start from a cold start? if you're running matrix, then by definition you're doing a spinup.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 10, 2024

I updated the questions above, from the mornings discussion.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Oct 15, 2024

Troubleshooting suggests that my gut feeling was wrong.

SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn started cold all along and it failed regardless, so I tried the following:
I turned off matrixcn and ran the case to generate a restart file. Then I turned on matrixcn and set finidat to this restart file. The simulation failed in the same line as before.

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn never started cold. I turned off matrix and generated a restart file. Then I turned on SASU and set finidat to this restart file. The simulation failed in the same line as before.

1x1 matrix tests that pass:

ERS_Lm54_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-ciso_monthly--clm-matrixcnOn
ERS_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianRs.izumi_intel.clm-cropMonthOutput--clm-matrixcnOn_ignore_warnings
ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRs.izumi_intel.clm-cropMonthlyNoinitial--clm-matrixcnOn.GC.1014-115134iz_int
  • A common element among the tests that pass is Clm50 and among the tests that fail Clm60 BUT our global Clm60 tests pass, so this observation may be irrelevant.
  • Another difference: the two failing tests use DEBUG while the passing tests do not. Again though our global matrix tests pass regardless.

@slevis-lmwg
Copy link
Contributor

Trying a Clm6 version and Clm6 DEBUG version of the first in the above list of already passing tests:

PASS ERS_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
PASS ERS_D_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup

and non-DEBUG versions of the failing tests:

PASS SMS_Ld10_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn
PASS SMS.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn

So DEBUG must be uncovering a problem in these two. I will think about what I want to try next...

@slevis-lmwg
Copy link
Contributor

I added diagnostic write-statements just before the error gets triggered in SparseMatrixMultiplyMod.F90 line 1246:
SHR_ASSERT_FL((size(filter_actunit_C) > num_actunit_C), sourcefile, __LINE__)
and both failing tests fail when they encounter
size(filter_actunit_C) = num_actunit_C
This seems like a non-dealbreaker to me, so I changed the ASSERT to ">="
The equality gets the currently failing tests to pass without triggering other problems.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Oct 16, 2024

@ekluzek I will run this by you before I open a PR with this code change.

My branch is in this directory:
/glade/work/slevis/git/LMWG_dev8
and open the PR with
git push -u slevis-lmwg fix_1x1_matrix_fails

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 18, 2024

@slevis-lmwg that's correct the inequality should be >= rather than just >. One point there is to just make sure the array size isn't too small. The array must've been larger all the time previously. I'd have to think about why that's the case...

I'm glad you were able to figure that out.

@slevis-lmwg slevis-lmwg linked a pull request Oct 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly priority: low Background task that doesn't need to be done right away. science Enhancement to or bug impacting science
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

4 participants