-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mksurfdata_esmf system test to test-suite #1756
Add mksurfdata_esmf system test to test-suite #1756
Conversation
…ata_esmf_sys_test Making sure I'm caught up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great to see. I have a few simple comments for small changes.
The other thing I wonder about is adding a "generic" machine option that would add some comments about how you need to change the script to work on a generic machine. So it would add something like this in front of the # PBS commands...
# Edit the batch directives for your batch system
# Below are the examples for the batch directives on cheyenne:
# Edit the following to work on your batch system
And then for the mpiexec line have it setup to use mpirun, and add a comment like this
# Edit the mpirun command to use the MPI executable on your system and the arguments it's requires.
An expansion that we will want for this is to add the ability to also run on izumi. It would be good to know we can easily run on NCAR machines cheyenne and izumi. We could make running on izumi require a custom script as another solution. But, it also doesn't look that hard to extend this to work on izumi as well.
@ekluzek |
Currently casper complains that `git -C` is not a valid option. I added -C to the `git describe` in gen_mksurfdata_namelist.py for this system test to work.
@billsacks and @ekluzek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done a super-careful review of this, but I feel like I have reviewed it enough. It looks great! Thanks a lot for your work on this, Sam. I do have two inline comments, which I don't feel absolutely need to be addressed, but would be good to do – especially the more substantive one about the location of the build directory, though I realize that could involve some significant changes to the current build script.
One other thing that I try to address whenever I'm adding a new test: Did you test some failures at different points (e.g., build failure of mksurfdata_esmf, runtime failure of mksurfdata_esmf, runtime failure of the model) to ensure errors are reported correctly in those failure cases?
if not os.path.exists(os.path.join(self._get_caseroot(), | ||
'done_MKSURFDATAESMF_setup.txt')): | ||
# Paths and strings | ||
self._rm_bld_dir = f"rm -rf {self._tool_path}/bld" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal that this needs to mess with directories in the CTSM sandbox. Some issues I see with that are:
- Trying to run multiple tests of this test type could cause failures due to race conditions: the tests could stomp on each other's build directory
- You could have problems if you're running the testing from a place where you also want to actually use the mksurfdata build for some real work at the same time
- You can't run tests from a directory where you don't have write permission (may not be an issue in practice)
I think the solution to this is to run the cmake and make commands from some other location that is inside the test directory rather than being under the ctsm root. If I remember correctly, the way cmake works is that you run cmake from your desired build location, pointing to the source directory. So you could create a build directory inside the test directory then run cmake /path/to/ctsmroot/tools/mksurfdata_esmf/src
. For this to work, the paths in gen_mksurfdata_build would need to be generalized so that it could be run from anywhere. Or maybe there's some other way to accomplish this goal.
I don't feel this absolutely needs to be done immediately, but it should be done at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a temporary solution, the test now deletes the /tools/mksurfdata_esmf/bld directory when it's done needing it. This reduces the likelihood of the above risks without eliminating it.
To resolve this, I would like to discuss it in a meeting (me and @ekluzek ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, the temporary solution could make things worse in a number of situations: where some other process is expecting the build to be there, and then it gets removed out from underneath. So I'd say what you had before this was probably a bit better... but really it's hard to avoid problems in certain situations until you move to a build in a test-specific location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I reverted to what I had before this.
Good point. While developing the test, I had to look at TestStatus.log for useful error messages. So I'm adding guidance along those lines when any of the subprocess commands fails. However, I'm leaving error messages unchanged for runtime failure of the model, with the assumption that such messages are standard across these tests. |
Sounds good. Needing to look in TestStatus.log is reasonable. The most important thing is that a failure at any stage results in a FAIL result somewhere in the TestStatus file. This should be true both for the initial run of the test and a rerun. Ideally, a failure in the build or run of mksurfdata_esmf would cause the test to abort with a FAIL before it gets into the model run. Similarly, if you run the test once and it passes, and then you make a change to mksurfdata_esmf and rerun the existing test, but now there is a build or runtime failure in mksurfdata_esmf, the test should also report a FAIL result in the appropriate phase. We want to avoid, for example, the possibility that in a rerun, mksurfdata_esmf fails but there was already an output file in place so the test happily continues with the previously-generated output file. |
Thanks @slevisconsulting ! |
I skimmed back through my earlier comments. Everything is addressed except for the comments on trying to do the build within the test directory if possible. |
@ekluzek per our conversation: |
Now the new mksurfdata_esmf system test doesn't create /bld in users' .../tools/mksurfdata_esmf and instead places this build directory where the test case builds and runs. This avoids potential confusion from a test manipulating files in users' ctsm directories.
...despite "git -C" option unavailable on casper
@ekluzek I will commit/push pylint corrections soon. At that point this PR will be ready for final review. |
Testing performed: @ekluzek pls let me know if you would like to review this PR in a meeting. |
I was looking this over and I think the best way to keep this updated and get this to work on izumi, will be to use the env_mach_specific.xml file to get what the mpirun command needs to look like. It looks like the mpirun command (from looking at it for a CTSM case is fairly complex and it depends on the MPI library being used. So for mvapich2
and for openmpi
The mpirun can be complex and changing with CESM, so using CESM to manage it might be the best way to go. That also provides a general solution that will work for any machine, compiler, and mpilib combination that cime is ported to. |
@slevisconsulting and I looked this over a bit and tried to get it working and run into some roadblocks. One thing that I think will be important to get working properly is the ability to turn DEBUG compiler options on in the build. We think that just requires the env variable DEBUG to be set before the configure line, but I'm suspicious that because this is a CMAKE build that there's more to it than that. |
Oh, I think I put my comments in the wrong PR... |
@ekluzek I wasn't pushing for wrapping up this PR and #1748 because of large diffs that I was finding in these files generated by mksurfdata_esmf: So I would like to spend a few minutes going over this PR and #1748 so that we may finalize them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple things that I see that could be done. But, it's nice to get a working system test in place, so they could be done later. So let's chat about this and decide what should be done.
# Paths and command strings | ||
executable_path = os.path.join(self._tool_bld, 'mksurfdata') | ||
machine = self._case.get_value("MACH") | ||
if machine == 'cheyenne': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to refactor this a bit to rely on the gen_mksurfdata_jobscript_single.py script to produce the commands that need to be run rather than hardcoding the machines that can be used here. That would be a nice improvement, but this could also be a future change. The advantage is that it removes embedding of machine specific things into this script, and puts all the machine specific run into one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...so run gen_mksurfdata_jobscript_single.py
and then run the jobscript without qsub in front.
...and most general approach is to pick up the mpi command from env_mach_specific.xml
(see Erik's comment about this) but open an issue for this.
The long version... System test MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default relies on the ability to build the mksurfdata executable anywhere. I had named this directory (that can go anywhere) /tool_bld to distinguish from /bld that appears during the sys test in the same case directory in prep for submitting the ctsm run. Due to changes coming in the next commit that make the sys test generate and submit the jobscript that runs the mksurfdata executable (instead of spelling out the command inside the sys test script), I had to make the name of the default bld directory the same as the one expected by the sys test.
...instead of spelling out the mpi command inside the sys test script which also required different outcomes for different hostnames. Now this is handled exclusively in gen_mksurfdata_jobscript_single.py
python testing OK
pymod testing PASS I plan on merging this PR to the ctsm5.2.mksurfdata branch tomorrow. |
…f_casper Add mksurfdata_esmf system test to test-suite ESCOMP#1756 Confirms that mksurfdata_esmf builds & generates an fsurdat. Confirms that the CTSM completes when pointing to the generated fsurdat. Remove mksurfdata_map test from tools testing (if not done already). Contributors other than @slevisconsulting: @ekluzek @billsacks CTSM Issues Fixed: ESCOMP#1717 Answers expected to change? No, this is a new test. User Interface Changes? New test will be added to test-suite. Testing performed: pymod testing PASS. python testing OK. Two lilac sys tests fail due to: > git describe alpha-ctsm5.2.mksrf.03_ctsm5.1.dev090-33-gde3fb92b6 Resolved Conflicts: tools/mksurfdata_esmf/gen_mksurfdata_jobscript_single.py tools/mksurfdata_esmf/gen_mksurfdata_namelist.py
Description of changes
Confirm that mksurfdata_esmf builds and generates an fsurdat (CTSM surface dataset).
Confirm that the CTSM completes when pointing to the generated fsurdat.
Remove mksurfdata_map test from tools testing (if not done already).
Specific notes
Contributors other than yourself, if any:
@billsacks @ekluzek
CTSM Issues Fixed (include github issue #):
#1717
Are answers expected to change (and if so in what way)?
No, this is a new test.
Any User Interface Changes (namelist or namelist defaults changes)?
New test will be added to test-suite.
Testing performed, if any:
Successfully ran a very early draft of the script using the test
MKSURFDATAESMF_P144x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-default
Generating the f10 surface dataset points to a lower resolution topography raw dataset, thereby bypassing the 1-km topography raw dataset. See corresponding failures below.
Failures:
MKSURFDATAESMF_P144x1.f19_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default
with mpiexec_mpt -np 144. Stops while working with the 1-km topography raw dataset, so I assume it needs more memory per node.MKSURFDATAESMF_P48x3.f19_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default
with mpiexec_mpt -np 48 (144 returned error). Stops while working with the 1-km topography raw dataset.