Add mksurfdata_esmf system test to test-suite #1756

slevis-lmwg · 2022-05-17T17:30:45Z

Description of changes

Confirm that mksurfdata_esmf builds and generates an fsurdat (CTSM surface dataset).
Confirm that the CTSM completes when pointing to the generated fsurdat.

Remove mksurfdata_map test from tools testing (if not done already).

Specific notes

Contributors other than yourself, if any:
@billsacks @ekluzek

CTSM Issues Fixed (include github issue #):
#1717

Are answers expected to change (and if so in what way)?
No, this is a new test.

Any User Interface Changes (namelist or namelist defaults changes)?
New test will be added to test-suite.

Testing performed, if any:
Successfully ran a very early draft of the script using the test
MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default
Generating the f10 surface dataset points to a lower resolution topography raw dataset, thereby bypassing the 1-km topography raw dataset. See corresponding failures below.

Failures:
MKSURFDATAESMF_P144x1.f19_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default with mpiexec_mpt -np 144. Stops while working with the 1-km topography raw dataset, so I assume it needs more memory per node.
MKSURFDATAESMF_P48x3.f19_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default with mpiexec_mpt -np 48 (144 returned error). Stops while working with the 1-km topography raw dataset.

…ata_esmf_sys_test Making sure I'm caught up.

ekluzek

This is really great to see. I have a few simple comments for small changes.

The other thing I wonder about is adding a "generic" machine option that would add some comments about how you need to change the script to work on a generic machine. So it would add something like this in front of the # PBS commands...

# Edit the batch directives for your batch system
# Below are the examples for the batch directives on cheyenne:
# Edit the following to work on your batch system

And then for the mpiexec line have it setup to use mpirun, and add a comment like this

# Edit the mpirun command to use the MPI executable on your system and the arguments it's requires.

An expansion that we will want for this is to add the ability to also run on izumi. It would be good to know we can easily run on NCAR machines cheyenne and izumi. We could make running on izumi require a custom script as another solution. But, it also doesn't look that hard to extend this to work on izumi as well.

cime_config/SystemTests/mksurfdataesmf.py

tools/mksurfdata_esmf/gen_mksurfdata_build.sh

tools/mksurfdata_esmf/gen_mksurfdata_jobscript_single.py

slevis-lmwg · 2022-05-18T00:46:56Z

And then for the mpiexec line have it setup to use mpirun, and add a comment like this
# Edit the mpirun command to use the MPI executable on your system and the arguments it's requires.

@ekluzek
Are you suggesting that I set it up to use mpirun for all machines, including cheyenne and casper? Or set it up with mpirun only for other machines?

Currently casper complains that `git -C` is not a valid option. I added -C to the `git describe` in gen_mksurfdata_namelist.py for this system test to work.

slevis-lmwg · 2022-05-18T22:40:45Z

@billsacks and @ekluzek
I think this work is at a good stopping point, so I'm requesting your reviews.

billsacks

I haven't done a super-careful review of this, but I feel like I have reviewed it enough. It looks great! Thanks a lot for your work on this, Sam. I do have two inline comments, which I don't feel absolutely need to be addressed, but would be good to do – especially the more substantive one about the location of the build directory, though I realize that could involve some significant changes to the current build script.

One other thing that I try to address whenever I'm adding a new test: Did you test some failures at different points (e.g., build failure of mksurfdata_esmf, runtime failure of mksurfdata_esmf, runtime failure of the model) to ensure errors are reported correctly in those failure cases?

cime_config/SystemTests/mksurfdataesmf.py

billsacks · 2022-05-19T07:01:08Z

cime_config/SystemTests/mksurfdataesmf.py

+        if not os.path.exists(os.path.join(self._get_caseroot(),
+            'done_MKSURFDATAESMF_setup.txt')):
+            # Paths and strings
+            self._rm_bld_dir = f"rm -rf {self._tool_path}/bld"


It's not ideal that this needs to mess with directories in the CTSM sandbox. Some issues I see with that are:

Trying to run multiple tests of this test type could cause failures due to race conditions: the tests could stomp on each other's build directory

You could have problems if you're running the testing from a place where you also want to actually use the mksurfdata build for some real work at the same time

You can't run tests from a directory where you don't have write permission (may not be an issue in practice)

I think the solution to this is to run the cmake and make commands from some other location that is inside the test directory rather than being under the ctsm root. If I remember correctly, the way cmake works is that you run cmake from your desired build location, pointing to the source directory. So you could create a build directory inside the test directory then run cmake /path/to/ctsmroot/tools/mksurfdata_esmf/src. For this to work, the paths in gen_mksurfdata_build would need to be generalized so that it could be run from anywhere. Or maybe there's some other way to accomplish this goal.

I don't feel this absolutely needs to be done immediately, but it should be done at some point.

As a temporary solution, the test now deletes the /tools/mksurfdata_esmf/bld directory when it's done needing it. This reduces the likelihood of the above risks without eliminating it.

To resolve this, I would like to discuss it in a meeting (me and @ekluzek ?)

In my view, the temporary solution could make things worse in a number of situations: where some other process is expecting the build to be there, and then it gets removed out from underneath. So I'd say what you had before this was probably a bit better... but really it's hard to avoid problems in certain situations until you move to a build in a test-specific location.

Ok, I reverted to what I had before this.

slevis-lmwg · 2022-05-19T23:05:55Z

Did you test some failures at different points (e.g., build failure of mksurfdata_esmf, runtime failure of mksurfdata_esmf, runtime failure of the model) to ensure errors are reported correctly in those failure cases?

Good point. While developing the test, I had to look at TestStatus.log for useful error messages. So I'm adding guidance along those lines when any of the subprocess commands fails. However, I'm leaving error messages unchanged for runtime failure of the model, with the assumption that such messages are standard across these tests.

billsacks · 2022-05-20T03:10:02Z

Good point. While developing the test, I had to look at TestStatus.log for useful error messages. So I'm adding guidance along those lines when any of the subprocess commands fails. However, I'm leaving error messages unchanged for runtime failure of the model, with the assumption that such messages are standard across these tests.

Sounds good. Needing to look in TestStatus.log is reasonable. The most important thing is that a failure at any stage results in a FAIL result somewhere in the TestStatus file. This should be true both for the initial run of the test and a rerun. Ideally, a failure in the build or run of mksurfdata_esmf would cause the test to abort with a FAIL before it gets into the model run. Similarly, if you run the test once and it passes, and then you make a change to mksurfdata_esmf and rerun the existing test, but now there is a build or runtime failure in mksurfdata_esmf, the test should also report a FAIL result in the appropriate phase. We want to avoid, for example, the possibility that in a rerun, mksurfdata_esmf fails but there was already an output file in place so the test happily continues with the previously-generated output file.

)

billsacks · 2022-05-20T22:17:59Z

Thanks @slevisconsulting !

billsacks · 2022-05-20T22:27:35Z

I skimmed back through my earlier comments. Everything is addressed except for the comments on trying to do the build within the test directory if possible.

slevis-lmwg · 2022-05-23T23:22:00Z

The other thing I wonder about is adding a "generic" machine option that would add some comments about how you need to change the script to work on a generic machine. So it would add something like this in front of the # PBS commands...

@ekluzek per our conversation:
I included the guidance comments that you recommended and will not pursue implementation of a "generic" machine option for now.

Now the new mksurfdata_esmf system test doesn't create /bld in users' .../tools/mksurfdata_esmf and instead places this build directory where the test case builds and runs. This avoids potential confusion from a test manipulating files in users' ctsm directories.

...despite "git -C" option unavailable on casper

slevis-lmwg · 2022-05-24T16:31:43Z

@ekluzek I will commit/push pylint corrections soon. At that point this PR will be ready for final review.

slevis-lmwg · 2022-05-24T20:23:20Z

Testing performed:
MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default PASS
make utest PASS
make stest two lilac tests fail with "Current machine cheyenne does not match case machine ctsm-build"
...but this is expected for git describe --> alpha-ctsm5.2.mksrf.03_ctsm5.1.dev090-26-g8b720987b as explained in #1739 which is resolved in later versions.

@ekluzek pls let me know if you would like to review this PR in a meeting.

ekluzek · 2022-05-26T04:38:19Z

I was looking this over and I think the best way to keep this updated and get this to work on izumi, will be to use the env_mach_specific.xml file to get what the mpirun command needs to look like. It looks like the mpirun command (from looking at it for a CTSM case is fairly complex and it depends on the MPI library being used.

So for mvapich2

mpiexec --machinefile $ENV{PBS_NODEFILE} -n <ntasks> --prepend-rank

and for openmpi

mpiexec -n <ntasks> --tag-output

The mpirun can be complex and changing with CESM, so using CESM to manage it might be the best way to go. That also provides a general solution that will work for any machine, compiler, and mpilib combination that cime is ported to.

ekluzek · 2022-05-26T04:40:30Z

@slevisconsulting and I looked this over a bit and tried to get it working and run into some roadblocks. One thing that I think will be important to get working properly is the ability to turn DEBUG compiler options on in the build. We think that just requires the env variable DEBUG to be set before the configure line, but I'm suspicious that because this is a CMAKE build that there's more to it than that.

ekluzek · 2022-05-26T04:48:52Z

Oh, I think I put my comments in the wrong PR...

slevis-lmwg · 2022-06-07T23:57:07Z

@ekluzek pls let me know if you would like to review this PR in a meeting.

@ekluzek I wasn't pushing for wrapping up this PR and #1748 because of large diffs that I was finding in these files generated by mksurfdata_esmf:
surfdata_0.9x1.25_SSP5-8.5_78pfts_CMIP6_1850-2100_c220525.nc
landuse.timeseries_0.9x1.25_SSP5-8.5_78_CMIP6_1850-2100_c220525.nc
The good news is that I have now tracked down the diffs, and they are fully expected. In particular, the large diffs were against release-clm5.0.18, while the diffs are very small when I compare against the files in /glade/work/oleson/ctsm_dynurb_mksurf_ctsm5.2.mksurfdata_I1564/tools/mksurfdata_map because Keith generated these files with the same code mods that I used from #1586 and #1587

So I would like to spend a few minutes going over this PR and #1748 so that we may finalize them.

ekluzek

There's a couple things that I see that could be done. But, it's nice to get a working system test in place, so they could be done later. So let's chat about this and decide what should be done.

ekluzek · 2022-06-08T05:00:11Z

cime_config/SystemTests/mksurfdataesmf.py

+        # Paths and command strings
+        executable_path = os.path.join(self._tool_bld, 'mksurfdata')
+        machine = self._case.get_value("MACH")
+        if machine == 'cheyenne':


You should be able to refactor this a bit to rely on the gen_mksurfdata_jobscript_single.py script to produce the commands that need to be run rather than hardcoding the machines that can be used here. That would be a nice improvement, but this could also be a future change. The advantage is that it removes embedding of machine specific things into this script, and puts all the machine specific run into one place.

I like that idea.

...so run gen_mksurfdata_jobscript_single.py and then run the jobscript without qsub in front.
...and most general approach is to pick up the mpi command from env_mach_specific.xml (see Erik's comment about this) but open an issue for this.

The long version... System test MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default relies on the ability to build the mksurfdata executable anywhere. I had named this directory (that can go anywhere) /tool_bld to distinguish from /bld that appears during the sys test in the same case directory in prep for submitting the ctsm run. Due to changes coming in the next commit that make the sys test generate and submit the jobscript that runs the mksurfdata executable (instead of spelling out the command inside the sys test script), I had to make the name of the default bld directory the same as the one expected by the sys test.

...instead of spelling out the mpi command inside the sys test script which also required different outcomes for different hostnames. Now this is handled exclusively in gen_mksurfdata_jobscript_single.py

slevis-lmwg · 2022-07-01T00:40:31Z

python testing OK
Two lilac sys tests fail due to this:

> git describe
alpha-ctsm5.2.mksrf.03_ctsm5.1.dev090-33-gde3fb92b6

pymod testing PASS

I plan on merging this PR to the ctsm5.2.mksurfdata branch tomorrow.

@ekluzek

…f_casper Add mksurfdata_esmf system test to test-suite ESCOMP#1756 Confirms that mksurfdata_esmf builds & generates an fsurdat. Confirms that the CTSM completes when pointing to the generated fsurdat. Remove mksurfdata_map test from tools testing (if not done already). Contributors other than @slevisconsulting: @ekluzek @billsacks CTSM Issues Fixed: ESCOMP#1717 Answers expected to change? No, this is a new test. User Interface Changes? New test will be added to test-suite. Testing performed: pymod testing PASS. python testing OK. Two lilac sys tests fail due to: > git describe alpha-ctsm5.2.mksrf.03_ctsm5.1.dev090-33-gde3fb92b6 Resolved Conflicts: tools/mksurfdata_esmf/gen_mksurfdata_jobscript_single.py tools/mksurfdata_esmf/gen_mksurfdata_namelist.py

slevis-lmwg added 4 commits May 12, 2022 10:58

Merge remote-tracking branch 'escomp/ctsm5.2.mksurfdata' into mksurfd…

007ea0c

…ata_esmf_sys_test Making sure I'm caught up.

Enable mksurfdata_esmf build and run on casper

198beb4

Add . ./.env_mach_specific.sh to gen_mksurfdata_jobscript_*.py

0a25ee0

First successful version of new test-suite system test

dd72733

slevis-lmwg self-assigned this May 17, 2022

slevis-lmwg added the tag: support tools only label May 17, 2022

ekluzek requested changes May 17, 2022

View reviewed changes

slevis-lmwg added 3 commits May 17, 2022 18:01

Changes needed for gen_mksurfdata_build.sh to work with new system test

b8fc136

Fleshing out the steps to complete this system test (not done)

32264cd

Small revisions in response to ErikK's code review

575f770

slevis-lmwg added 5 commits May 17, 2022 18:56

More updates in response to code review

d1de496

Further fleshing out of the steps involved in this test

a0540a4

Remove some hardwiring

5f02dfe

Update some comments

9621dd6

Adding flexibility of running this test from casper, though read caveat:

cd0daa1

Currently casper complains that `git -C` is not a valid option. I added -C to the `git describe` in gen_mksurfdata_namelist.py for this system test to work.

slevis-lmwg mentioned this pull request May 18, 2022

Enable mksurfdata esmf to build and run on casper and izumi #1748

Merged

slevis-lmwg added 2 commits May 18, 2022 16:28

Added comment to the script's docstring

f1253dc

Added the new test to testlist_clm.xml

08a77d7

slevis-lmwg requested review from billsacks and ekluzek May 18, 2022 22:39

billsacks reviewed May 19, 2022

View reviewed changes

slevis-lmwg added 3 commits May 19, 2022 13:46

Remove "_" and "self._" from local variables

128bef9

Remove /tools/mksurfdata_esmf/bld directory when done using it

182458c

Improved error messages from subprocess.check_call commands

0134b4e

Keep mksurfdata /bld after it's no longer needed (reverts ESCOMP@182458c

a5a1d66

)

Changes to gen_mksurfdata_build.sh (in progress)

dca62af

slevis-lmwg added 2 commits May 24, 2022 10:00

Enable gen_mksurfdata_namelist.py to also work on casper

8c8ac64

...despite "git -C" option unavailable on casper

pylint recommendations and other clean-up

8b72098

Error msg for when "machine" is not valid

1fca36c

ekluzek reviewed Jun 8, 2022

View reviewed changes

This was referenced Jun 8, 2022

Set DEBUG option(s) in mksurfdata_esmf build #1775

Open

Have mksurfdata_esmf pick up the mpi command from env_mach_specific.xml #1776

Closed

slevis-lmwg added 6 commits June 8, 2022 16:43

Update sys test to generate & submit the jobscript that runs mksurfdata

c46720f

...instead of spelling out the mpi command inside the sys test script which also required different outcomes for different hostnames. Now this is handled exclusively in gen_mksurfdata_jobscript_single.py

Corretion of a path in gen_mksurfdata_jobscript_single.py

57b936a

Small corrections to README

8efad79

Clarify help-string pertaining to --model-mesh-nx & --model-mesh-ny

7806635

Updates based on pylint recommendations

de3fb92

slevis-lmwg marked this pull request as ready for review July 1, 2022 18:46

slevis-lmwg merged commit 4dca577 into ESCOMP:ctsm5.2.mksurfdata Jul 1, 2022

slevis-lmwg deleted the mksurfdata_esmf_sys_test branch July 1, 2022 18:49

ekluzek mentioned this pull request Feb 17, 2024

ctsm5.2.0 -- ctsm5.2.mksurfdata #2372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mksurfdata_esmf system test to test-suite #1756

Add mksurfdata_esmf system test to test-suite #1756

slevis-lmwg commented May 17, 2022 •

edited

Loading

ekluzek left a comment •

edited

Loading

slevis-lmwg commented May 18, 2022 •

edited

Loading

slevis-lmwg commented May 18, 2022

billsacks left a comment

billsacks May 19, 2022

slevis-lmwg May 19, 2022

billsacks May 20, 2022

slevis-lmwg May 20, 2022

slevis-lmwg commented May 19, 2022

billsacks commented May 20, 2022

billsacks commented May 20, 2022

billsacks commented May 20, 2022

slevis-lmwg commented May 23, 2022

slevis-lmwg commented May 24, 2022

slevis-lmwg commented May 24, 2022

ekluzek commented May 26, 2022

ekluzek commented May 26, 2022

ekluzek commented May 26, 2022

slevis-lmwg commented Jun 7, 2022

ekluzek left a comment

ekluzek Jun 8, 2022

slevis-lmwg Jun 8, 2022

slevis-lmwg Jun 8, 2022 •

edited

Loading

slevis-lmwg commented Jul 1, 2022

Add mksurfdata_esmf system test to test-suite #1756

Add mksurfdata_esmf system test to test-suite #1756

Conversation

slevis-lmwg commented May 17, 2022 • edited Loading

Description of changes

Specific notes

ekluzek left a comment • edited Loading

Choose a reason for hiding this comment

slevis-lmwg commented May 18, 2022 • edited Loading

slevis-lmwg commented May 18, 2022

billsacks left a comment

Choose a reason for hiding this comment

billsacks May 19, 2022

Choose a reason for hiding this comment

slevis-lmwg May 19, 2022

Choose a reason for hiding this comment

billsacks May 20, 2022

Choose a reason for hiding this comment

slevis-lmwg May 20, 2022

Choose a reason for hiding this comment

slevis-lmwg commented May 19, 2022

billsacks commented May 20, 2022

billsacks commented May 20, 2022

billsacks commented May 20, 2022

slevis-lmwg commented May 23, 2022

slevis-lmwg commented May 24, 2022

slevis-lmwg commented May 24, 2022

ekluzek commented May 26, 2022

ekluzek commented May 26, 2022

ekluzek commented May 26, 2022

slevis-lmwg commented Jun 7, 2022

ekluzek left a comment

Choose a reason for hiding this comment

ekluzek Jun 8, 2022

Choose a reason for hiding this comment

slevis-lmwg Jun 8, 2022

Choose a reason for hiding this comment

slevis-lmwg Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

slevis-lmwg commented Jul 1, 2022

slevis-lmwg commented May 17, 2022 •

edited

Loading

ekluzek left a comment •

edited

Loading

slevis-lmwg commented May 18, 2022 •

edited

Loading

slevis-lmwg Jun 8, 2022 •

edited

Loading