Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mksurfdata_esmf system test to test-suite #1756

Merged

Conversation

slevis-lmwg
Copy link
Contributor

@slevis-lmwg slevis-lmwg commented May 17, 2022

Description of changes

Confirm that mksurfdata_esmf builds and generates an fsurdat (CTSM surface dataset).
Confirm that the CTSM completes when pointing to the generated fsurdat.

Remove mksurfdata_map test from tools testing (if not done already).

Specific notes

Contributors other than yourself, if any:
@billsacks @ekluzek

CTSM Issues Fixed (include github issue #):
#1717

Are answers expected to change (and if so in what way)?
No, this is a new test.

Any User Interface Changes (namelist or namelist defaults changes)?
New test will be added to test-suite.

Testing performed, if any:
Successfully ran a very early draft of the script using the test
MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default
Generating the f10 surface dataset points to a lower resolution topography raw dataset, thereby bypassing the 1-km topography raw dataset. See corresponding failures below.

Failures:
MKSURFDATAESMF_P144x1.f19_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default with mpiexec_mpt -np 144. Stops while working with the 1-km topography raw dataset, so I assume it needs more memory per node.
MKSURFDATAESMF_P48x3.f19_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default with mpiexec_mpt -np 48 (144 returned error). Stops while working with the 1-km topography raw dataset.

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great to see. I have a few simple comments for small changes.

The other thing I wonder about is adding a "generic" machine option that would add some comments about how you need to change the script to work on a generic machine. So it would add something like this in front of the # PBS commands...

# Edit the batch directives for your batch system
# Below are the examples for the batch directives on cheyenne:
# Edit the following to work on your batch system

And then for the mpiexec line have it setup to use mpirun, and add a comment like this

# Edit the mpirun command to use the MPI executable on your system and the arguments it's requires.

An expansion that we will want for this is to add the ability to also run on izumi. It would be good to know we can easily run on NCAR machines cheyenne and izumi. We could make running on izumi require a custom script as another solution. But, it also doesn't look that hard to extend this to work on izumi as well.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented May 18, 2022

And then for the mpiexec line have it setup to use mpirun, and add a comment like this
# Edit the mpirun command to use the MPI executable on your system and the arguments it's requires.

@ekluzek
Are you suggesting that I set it up to use mpirun for all machines, including cheyenne and casper? Or set it up with mpirun only for other machines?

@slevis-lmwg
Copy link
Contributor Author

@billsacks and @ekluzek
I think this work is at a good stopping point, so I'm requesting your reviews.

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a super-careful review of this, but I feel like I have reviewed it enough. It looks great! Thanks a lot for your work on this, Sam. I do have two inline comments, which I don't feel absolutely need to be addressed, but would be good to do – especially the more substantive one about the location of the build directory, though I realize that could involve some significant changes to the current build script.

One other thing that I try to address whenever I'm adding a new test: Did you test some failures at different points (e.g., build failure of mksurfdata_esmf, runtime failure of mksurfdata_esmf, runtime failure of the model) to ensure errors are reported correctly in those failure cases?

cime_config/SystemTests/mksurfdataesmf.py Outdated Show resolved Hide resolved
if not os.path.exists(os.path.join(self._get_caseroot(),
'done_MKSURFDATAESMF_setup.txt')):
# Paths and strings
self._rm_bld_dir = f"rm -rf {self._tool_path}/bld"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal that this needs to mess with directories in the CTSM sandbox. Some issues I see with that are:

  • Trying to run multiple tests of this test type could cause failures due to race conditions: the tests could stomp on each other's build directory
  • You could have problems if you're running the testing from a place where you also want to actually use the mksurfdata build for some real work at the same time
  • You can't run tests from a directory where you don't have write permission (may not be an issue in practice)

I think the solution to this is to run the cmake and make commands from some other location that is inside the test directory rather than being under the ctsm root. If I remember correctly, the way cmake works is that you run cmake from your desired build location, pointing to the source directory. So you could create a build directory inside the test directory then run cmake /path/to/ctsmroot/tools/mksurfdata_esmf/src. For this to work, the paths in gen_mksurfdata_build would need to be generalized so that it could be run from anywhere. Or maybe there's some other way to accomplish this goal.

I don't feel this absolutely needs to be done immediately, but it should be done at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a temporary solution, the test now deletes the /tools/mksurfdata_esmf/bld directory when it's done needing it. This reduces the likelihood of the above risks without eliminating it.

To resolve this, I would like to discuss it in a meeting (me and @ekluzek ?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, the temporary solution could make things worse in a number of situations: where some other process is expecting the build to be there, and then it gets removed out from underneath. So I'd say what you had before this was probably a bit better... but really it's hard to avoid problems in certain situations until you move to a build in a test-specific location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reverted to what I had before this.

@slevis-lmwg
Copy link
Contributor Author

Did you test some failures at different points (e.g., build failure of mksurfdata_esmf, runtime failure of mksurfdata_esmf, runtime failure of the model) to ensure errors are reported correctly in those failure cases?

Good point. While developing the test, I had to look at TestStatus.log for useful error messages. So I'm adding guidance along those lines when any of the subprocess commands fails. However, I'm leaving error messages unchanged for runtime failure of the model, with the assumption that such messages are standard across these tests.

@billsacks
Copy link
Member

Good point. While developing the test, I had to look at TestStatus.log for useful error messages. So I'm adding guidance along those lines when any of the subprocess commands fails. However, I'm leaving error messages unchanged for runtime failure of the model, with the assumption that such messages are standard across these tests.

Sounds good. Needing to look in TestStatus.log is reasonable. The most important thing is that a failure at any stage results in a FAIL result somewhere in the TestStatus file. This should be true both for the initial run of the test and a rerun. Ideally, a failure in the build or run of mksurfdata_esmf would cause the test to abort with a FAIL before it gets into the model run. Similarly, if you run the test once and it passes, and then you make a change to mksurfdata_esmf and rerun the existing test, but now there is a build or runtime failure in mksurfdata_esmf, the test should also report a FAIL result in the appropriate phase. We want to avoid, for example, the possibility that in a rerun, mksurfdata_esmf fails but there was already an output file in place so the test happily continues with the previously-generated output file.

@billsacks
Copy link
Member

Thanks @slevisconsulting !

@billsacks
Copy link
Member

I skimmed back through my earlier comments. Everything is addressed except for the comments on trying to do the build within the test directory if possible.

@slevis-lmwg
Copy link
Contributor Author

The other thing I wonder about is adding a "generic" machine option that would add some comments about how you need to change the script to work on a generic machine. So it would add something like this in front of the # PBS commands...

@ekluzek per our conversation:
I included the guidance comments that you recommended and will not pursue implementation of a "generic" machine option for now.

Now the new mksurfdata_esmf system test doesn't create /bld in users'
.../tools/mksurfdata_esmf and instead places this build directory where
the test case builds and runs. This avoids potential confusion from a
test manipulating files in users' ctsm directories.
...despite "git -C" option unavailable on casper
@slevis-lmwg
Copy link
Contributor Author

@ekluzek I will commit/push pylint corrections soon. At that point this PR will be ready for final review.

@slevis-lmwg
Copy link
Contributor Author

Testing performed:
MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default PASS
make utest PASS
make stest two lilac tests fail with "Current machine cheyenne does not match case machine ctsm-build"
...but this is expected for git describe --> alpha-ctsm5.2.mksrf.03_ctsm5.1.dev090-26-g8b720987b as explained in #1739 which is resolved in later versions.

@ekluzek pls let me know if you would like to review this PR in a meeting.

@ekluzek
Copy link
Collaborator

ekluzek commented May 26, 2022

I was looking this over and I think the best way to keep this updated and get this to work on izumi, will be to use the env_mach_specific.xml file to get what the mpirun command needs to look like. It looks like the mpirun command (from looking at it for a CTSM case is fairly complex and it depends on the MPI library being used.

So for mvapich2

mpiexec --machinefile $ENV{PBS_NODEFILE} -n <ntasks> --prepend-rank

and for openmpi

mpiexec -n <ntasks> --tag-output

The mpirun can be complex and changing with CESM, so using CESM to manage it might be the best way to go. That also provides a general solution that will work for any machine, compiler, and mpilib combination that cime is ported to.

@ekluzek
Copy link
Collaborator

ekluzek commented May 26, 2022

@slevisconsulting and I looked this over a bit and tried to get it working and run into some roadblocks. One thing that I think will be important to get working properly is the ability to turn DEBUG compiler options on in the build. We think that just requires the env variable DEBUG to be set before the configure line, but I'm suspicious that because this is a CMAKE build that there's more to it than that.

@ekluzek
Copy link
Collaborator

ekluzek commented May 26, 2022

Oh, I think I put my comments in the wrong PR...

@slevis-lmwg
Copy link
Contributor Author

@ekluzek pls let me know if you would like to review this PR in a meeting.

@ekluzek I wasn't pushing for wrapping up this PR and #1748 because of large diffs that I was finding in these files generated by mksurfdata_esmf:
surfdata_0.9x1.25_SSP5-8.5_78pfts_CMIP6_1850-2100_c220525.nc
landuse.timeseries_0.9x1.25_SSP5-8.5_78_CMIP6_1850-2100_c220525.nc
The good news is that I have now tracked down the diffs, and they are fully expected. In particular, the large diffs were against release-clm5.0.18, while the diffs are very small when I compare against the files in /glade/work/oleson/ctsm_dynurb_mksurf_ctsm5.2.mksurfdata_I1564/tools/mksurfdata_map because Keith generated these files with the same code mods that I used from #1586 and #1587

So I would like to spend a few minutes going over this PR and #1748 so that we may finalize them.

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple things that I see that could be done. But, it's nice to get a working system test in place, so they could be done later. So let's chat about this and decide what should be done.

# Paths and command strings
executable_path = os.path.join(self._tool_bld, 'mksurfdata')
machine = self._case.get_value("MACH")
if machine == 'cheyenne':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to refactor this a bit to rely on the gen_mksurfdata_jobscript_single.py script to produce the commands that need to be run rather than hardcoding the machines that can be used here. That would be a nice improvement, but this could also be a future change. The advantage is that it removes embedding of machine specific things into this script, and puts all the machine specific run into one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea.

Copy link
Contributor Author

@slevis-lmwg slevis-lmwg Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...so run gen_mksurfdata_jobscript_single.py and then run the jobscript without qsub in front.
...and most general approach is to pick up the mpi command from env_mach_specific.xml (see Erik's comment about this) but open an issue for this.

The long version...
System test
MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default
relies on the ability to build the mksurfdata executable anywhere. I had
named this directory (that can go anywhere) /tool_bld to distinguish
from /bld that appears during the sys test in the same case directory in
prep for submitting the ctsm run. Due to changes coming in the next
commit that make the sys test generate and submit the jobscript that
runs the mksurfdata executable (instead of spelling out the command
inside the sys test script), I had to make the name of the default bld
directory the same as the one expected by the sys test.
...instead of spelling out the mpi command inside the sys test script
which also required different outcomes for different hostnames. Now
this is handled exclusively in gen_mksurfdata_jobscript_single.py
@slevis-lmwg
Copy link
Contributor Author

python testing OK
Two lilac sys tests fail due to this:

> git describe
alpha-ctsm5.2.mksrf.03_ctsm5.1.dev090-33-gde3fb92b6

pymod testing PASS

I plan on merging this PR to the ctsm5.2.mksurfdata branch tomorrow.

@slevis-lmwg slevis-lmwg marked this pull request as ready for review July 1, 2022 18:46
@slevis-lmwg slevis-lmwg merged commit 4dca577 into ESCOMP:ctsm5.2.mksurfdata Jul 1, 2022
@slevis-lmwg slevis-lmwg deleted the mksurfdata_esmf_sys_test branch July 1, 2022 18:49
slevis-lmwg added a commit to slevis-lmwg/ctsm that referenced this pull request Jul 2, 2022
…f_casper

Add mksurfdata_esmf system test to test-suite ESCOMP#1756

Confirms that mksurfdata_esmf builds & generates an fsurdat.
Confirms that the CTSM completes when pointing to the generated fsurdat.
Remove mksurfdata_map test from tools testing (if not done already).

Contributors other than @slevisconsulting: @ekluzek @billsacks
CTSM Issues Fixed: ESCOMP#1717
Answers expected to change? No, this is a new test.
User Interface Changes? New test will be added to test-suite.

Testing performed:
pymod testing PASS.
python testing OK. Two lilac sys tests fail due to:
> git describe
alpha-ctsm5.2.mksrf.03_ctsm5.1.dev090-33-gde3fb92b6

Resolved Conflicts:
tools/mksurfdata_esmf/gen_mksurfdata_jobscript_single.py
tools/mksurfdata_esmf/gen_mksurfdata_namelist.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants