Skip to content

Conversation

@ekluzek
Copy link
Collaborator

@ekluzek ekluzek commented May 10, 2025

Description of changes

Update submodules to cesm3_0_beta06

This starts from #3111 which brings in the answer changes for derecho_intel, by updating the compiler to use the intel-oneapi backend.

Contributors other than yourself, if any:

CTSM Issues Fixed (include github issue #):
Fixes #2710
Fixes #2476
Fixes #3135
Fixes #3108
Address some things in #3156

Are answers expected to change (and if so in what way)? Yes
derecho_intel and derecho_nvhpc

Any User Interface Changes (namelist or namelist defaults changes)? No

Does this create a need to change or add documentation? Did you do so? No No

Testing performed, if any: Running regular testing ctsm_sci and fates test lists

@ekluzek ekluzek added this to the cesm3_0_beta07 milestone May 10, 2025
@ekluzek ekluzek self-assigned this May 10, 2025
@ekluzek ekluzek added enhancement new capability or improved behavior of existing capability priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations non-bfb Changes answers (incl. adding tests) labels May 10, 2025
@github-project-automation github-project-automation bot moved this to Ready to start (or start again) in CTSM: Upcoming tags May 10, 2025
@ekluzek ekluzek moved this from Ready to start (or start again) to In progress - master in CTSM: Upcoming tags May 10, 2025
@ekluzek
Copy link
Collaborator Author

ekluzek commented May 10, 2025

aux_clm testing on Izumi is as expected. No differences to baseline and all expected tests pass.

On Derecho however the following are unexpected:

Three compare different to baseline:

SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop		
SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen		
SMS_Ld5.f10_f10_mg37.ISSP245Clm50BgcCrop.derecho_gnu.clm-ciso_dec2050Start	

The following 22 tests are listed as pending, where they were submitted, but didn't seem to execute while running. Which is odd:

ERS_D_Ld5_Mmpi-serial.1x1_mexicocityMEX.I1PtClm60SpRs.derecho_gnu.clm-CLM1PTStartDate		
ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropRs.derecho_intel.clm-decStart1851_noinitial		
ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold		
ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60FatesRs.derecho_intel.clm-FatesCold		
ERS_Ld1640_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup		
ERS_Ld600_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.derecho_gnu.clm-cropMonthlyNoinitial		
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.derecho_gnu.clm-ciso_monthly		
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.derecho_gnu.clm-ciso_monthly--clm-matrixcnOn		
SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn		
SMS_D.1x1_brazil.I2000Clm60FatesSpCruRsGs.derecho_gnu.clm-FatesColdDryDepSatPhen		
SMS_D.1x1_brazil.I2000Clm60FatesSpCruRsGs.derecho_gnu.clm-FatesColdMeganSatPhen		
SMS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60Bgc.derecho_gnu.clm-HillslopeC		
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-NEON-MOAB--clm-PRISM		EXPECTED (SHAREDLIB_BUILD RUN)
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV		EXPECTED (SHAREDLIB_BUILD)
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn		
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_gnu.clm-FatesPRISM--clm-NEON-FATES-YELL		EXPECTED (SHAREDLIB_BUILD RUN)
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO		EXPECTED (SHAREDLIB_BUILD RUN)
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60SpRs.derecho_intel.clm-default--clm-NEON-TOOL		
SMS_Ld12_Mmpi-serial.1x1_vancouverCAN.I1PtClm60SpRs.derecho_gnu.clm-output_sp_highfreq		
SMS_Ld5_Mmpi-serial.1x1_brazil.IHistClm60Bgc.derecho_gnu.clm-mimics		
SMS_Ly1_Mmpi-serial.1x1_brazil.IHistClm60BgcQianRs.derecho_intel.clm-output_bgc_highfreq		
SMS_Ly5_Mmpi-serial.1x1_smallvilleIA.IHistClm60BgcCropQianRs.derecho_gnu.clm-gregorian_cropMonthOutput

The following three failed unexpectedly:

FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel	(NLCOMP RUN)		
MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel	(SHAREDLIB_BUILD NLCOMP)		
SMS_D_Ld5.f09_g17.ISSP126Clm50BgcCrop.derecho_intel.clm-datm_ssp126_anom_forc	(SETUP)

For the tests with differences:

The difference in the first two nvhpc tests is probably because the nvhpc build was changed to use: cray-libsci/24.03.0 which it didn't use before.

The ISSP245Clm50BgcCrop compset changes answers because of the update to the CDEPS tag where ISSP cases turn on anomaly forcing out of the box.

So answer changes are all expected.

For the pending tests

I resubmitted one and it acted the same, returning quickly. So I'll need to look into this further.

The three fails

FUNIT and MKSURF fails with:

copying /glade/work/erik/ctsm_worktrees/external_updates/ccs_config/machines/cmake_macros/../derecho/derecho.cmake to /glade/derecho/scratch/erik/tests_ctsm5341cesm3b6acl/FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel.GC.ctsm5341cesm3b6acl_int/bld/cmake_macros
copying /glade/work/erik/ctsm_worktrees/external_updates/ccs_config/machines/cmake_macros/../derecho/gnu_derecho.cmake to /glade/derecho/scratch/erik/tests_ctsm5341cesm3b6acl/FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel.GC.ctsm5341cesm3b6acl_int/bld/cmake_macros
ERROR: FakeCase does not support getting value of 'GPU_TYPE'

So there's probably some simple adjustments to the latest submodules to recognize GPU_TYPE. Which shouldn't be hard to fix.

The datm_ssp126_anom_forc testmod probably fails, because it needs #2686

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 10, 2025

The list of pending tests seem to legitimately fail early. And they give little sign of where it's dying and no traceback. I turned on PET files and upped the ESMF verbosity level to the max, and it didn't give me more information. And still nothing helpful. It does look like it's failing in the land initialization somewhere.

These cases are also failing with both gnu and intel compilers and with DEBUG on and off. The commonality is that they are all mpi-serial. But, I didn't see a difference in the build better the previous working version #3111 and this one.

So next thing I will try is to do an incremental update of cesm alpha tag submodules, so testing alpha06e, and alpha06f to see where this behavior happens. This likely means that the problem is in either CMEPS or CDEPS maybe?

So tests to try:

alpha tag PASS Notes
alpha06d PASS #3111
alpha06e PASS
alpha06e+cime1.0.87+
mpi-serial2.5.4
PASS
alpha06e+previous + ccs-1.0.40 PASS This reduces it to one submodule of cime between 6.1.87 and 6.1.93
alpha06e+previous + next cime commit X Using git-bisect I got to the commit that fails
alpha06e+build-g X
alpha06f X
alpha06g X

The problem occurs between alpha06e and alpha06f. The difference in submodules is:

-fxtag = ccs_config_cesm1.0.32
+fxtag = ccs_config_cesm1.0.40

-fxtag = cime6.1.72
+fxtag = cime6.1.93

-fxtag = cmeps1.0.42
+fxtag = cmeps1.0.47

-fxtag = cdeps1.0.65
+fxtag = cdeps1.0.73

-fxtag = MPIserial_2.5.1
+fxtag = MPIserial_2.5.4

One way to divide it up is to put the build things together: ccs_config, cime, and mpi-serial, and the code things together: cmeps and cdeps.

It failed with leaving the code behind and updating the build. But, then passed when cime and ccs_config were backed off a bit. Updating ccs_config it still passes. And then logically it was between cime6.1.87 and cime6.1.93 which I could use git-bisect to find the commit in cime with the problem. It had to do with how much memory to ask for in the batch system.

ekluzek added 5 commits May 10, 2025 17:35
The mpi-serial case fails here.
This PASSes for mpi-serial
This was something that was in a CESM commit to .gitmodules
This passes. Which shows the problem is between cime6.1.87 and
cime6.1.93 so should be able to be solved with git-bisect.
Now, with a cime branch to fig the mpi-serial issue, update submodules
back up to cesm3_0_beta06 versions. I ran a list of tests that worked,
but now will run aux_clm again as well as ctsm_sci.
@ekluzek ekluzek changed the title Update submodules to cesm3_0_beta06 ctsdm5.3.04X: Update submodules to cesm3_0_beta06 May 12, 2025
@ekluzek ekluzek changed the title ctsdm5.3.04X: Update submodules to cesm3_0_beta06 ctsdm5.3.046: Update submodules to cesm3_0_beta06 May 12, 2025
@samsrabin samsrabin changed the title ctsdm5.3.046: Update submodules to cesm3_0_beta06 ctsm5.3.046: Update submodules to cesm3_0_beta06 May 14, 2025
…m_ssp126_anom_forc test because it no longer works, and the changes in ESCOMP#2686 handle it
ekluzek added 5 commits May 27, 2025 15:17
Remove MKSURFDATAESMF from prealpha testing.
Switch the prealpha plain ne30 test to ctsm_sci
Add a FATES NoComp test to prebeta
Remove two Clm45 tests from prealpha and aux_cime_baselines.
As well as two till tests from prebeta
Replace with Clm60 tests with ciso, izumi_nag, nldas, Fates, and DEBUG off mpi-serial for prebeta and the last for prealpha.
…lp with identifying mpi-serial build issues in the nightly testing with submodules updated
@ekluzek
Copy link
Collaborator Author

ekluzek commented May 28, 2025

OK, submitted all the testing to Izumi and Derecho: aux_clm, ctsm_sci, and fates. We'll see that shows tomorrow morning...

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 28, 2025

I got a few unexpected fails, but I have a fix for a couple of them, and will make one into an issue that we can probably fix on b4b-dev.

FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.izumi_intel	(MODEL_BUILD)
SMS_D.1x1_brazil.I2000Clm60FatesSpCruRsGs.derecho_gnu.clm-FatesColdSatPhen		
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO		EXPECTED (SHAREDLIB_BUILD RUN)
ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_gnu.clm-FatesCold	(RUN)
PVT_Lm3.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesLUPFT	(RUN)		EXPECTED (RUN)	

The FATES cases seem to be due to not enough memory for partial nodes. There's a simple fix for them, I'll increase memory asked for per task for FATES cases.

I'll file an issue for the FUNITCTSM problem, it's probably something in the cime update beyond cime6.1.100 where I last tested FUNITCTSM on Izumi.

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 28, 2025

There are a few baselines on Izum for FATES tests that I can't compare to because of permissions:

ERS_D_Ld30.f45_f45_mg37.HIST_DATM%CRUv7_CLM50%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP.izumi_nag.clm-FatesColdLandUse (BASELINE)
ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdHydro (BASELINE)
ERS_D_Ld5.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesCold (BASELINE)

But, outside of that, answers are as expected. Only megan fields for non-FATES tests, otherwise identical on izumi.

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 28, 2025

On Derecho answer changes are as expected:

derecho_intel and derecho_nvhpc change answers as expected
MEGAN fields change as expected
SSP and tests with drydep on change answers as expected

@wwieder wwieder changed the title ctsm5.3.050: Update submodules to cesm3_0_beta06 + MEGAN namelist (answer change) ctsm5.3.051: Update submodules to cesm3_0_beta06 + MEGAN namelist (answer change) May 29, 2025
Fix Linux Podman; prefer Linux Docker; update docs docs

 Conflicts:
	doc/ChangeLog
	doc/ChangeSum
Copy link
Contributor

@slevis-lmwg slevis-lmwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ekluzek

@ekluzek ekluzek merged commit 8f7d0c5 into ESCOMP:master May 30, 2025
3 checks passed
@github-project-automation github-project-automation bot moved this from In progress - master to Done (non release/external) in CTSM: Upcoming tags May 30, 2025
@ekluzek ekluzek deleted the update_submodules_to_cesm30_beta06 branch May 30, 2025 06:26
@slevis-lmwg slevis-lmwg linked an issue Jun 5, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement new capability or improved behavior of existing capability non-bfb Changes answers (incl. adding tests) priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations

Projects

Status: Done (non release/external)
Status: Done

3 participants