-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add/fix build capability for Gaea-C5, Gaea-C6, and container #800
Add/fix build capability for Gaea-C5, Gaea-C6, and container #800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me.
Two comments:
- I can't test the container.
- Additional changes are needed to run GSI/EnKF ctests on Gaea-C6. @DavidBurrows-NCO , do you plan on adding these changes to this PR or will a new issue and PR be opened to activate GSI/EnKF ctests on Gaea-C6?
@@ -155,7 +155,11 @@ target_link_libraries(gsi_fortran_obj PUBLIC nemsio::nemsio) | |||
target_link_libraries(gsi_fortran_obj PUBLIC ncio::ncio) | |||
target_link_libraries(gsi_fortran_obj PUBLIC w3emc::w3emc_d) | |||
target_link_libraries(gsi_fortran_obj PUBLIC sp::sp_d) | |||
target_link_libraries(gsi_fortran_obj PUBLIC bufr::bufr_d) | |||
if(DEFINED ENV{USE_BUFR4}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me. Cross checking with @DavidHuber-NOAA . PR #791 upgrades to bufr/12.1.0. Not sure how the bufr logic added here might impact Dave's PR.
@RussTreadon-NOAA Thanks for taking a look
|
Hello @RussTreadon-NOAA. I've made good progress on GSI reg tests. I'm currently using the same walltime/processor configuration as C5 for C6. This can be adjusted, but here are the current results::
rrfs_3denvar_rdasens_loproc_updat keeps hitting the wall clock even after I increased to 60 mins. It freezes in the same spot. I've attached a text file of the output log. I don't see anything too good in the working directory. Please let me know your thoughts. Thanks! |
Thank you @DavidBurrows-NCO for the update. This looks good. We've had problems with the |
C6 ctests results @DavidBurrows-NCO , I obtained similar C6 ctest results
C5 ctest results Ctest behavior on C5 is very different. The following tests ran and failed
rtma
global_enkf
Line 134 of global_4denvar
Not sure what's going on here. Do we need to (un)set certain environment variables on C5? The remaining tests
were all killed by the system after reaching the specified wall clock limit. Interestingly, the low task count
The C5
|
@RussTreadon-NOAA I vaguely recall something like this previously, like 6-9 months ago, where GSI would run but crash at the very end of execution. Do you recall this, or am I imagining it? |
@CoryMartin-NOAA , this sounds vaguely familiar. Let me check GSI issues and PRs for clues. The rrfs failure is a known problem. |
Hi @RussTreadon-NOAA. I know it's not the solution you want, but I adjusted the node/processor configuration to match Hera on C6 and rrfs was successful:
I'm working on C5 right now. |
@DavidBurrows-NCO : Changing the task count is consistent with the regional DA team recommendation. Thank you for looking at the C5 failures. |
@TingLei-NOAA informed me that he will be on leave and is unable to update the task count for @DavidBurrows-NCO , please commit the modified |
@DavidBurrows-NCO , we need two peer reviews for GSI PRs. My review doesn't count as a peer review. Who would you like to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve.
ush/sub_gaeac5
Outdated
@@ -158,6 +158,7 @@ sbatch=${sbatch:-sbatch} | |||
ofile=$DATA/subout$$ | |||
>$ofile | |||
chmod 777 $ofile | |||
export FI_VERBS_PREFER_XRC=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this setting resolve what appears to be mpi_finalize
problems on C5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this setting resolve what appears to be
mpi_finalize
problems on C5?
It appears so. Here is the notice from Seth Underwood with Gaea C5: "After the C5 update, users reported that some jobs failed during the MPI_Finalize call. We have alerted ORNL and HPE. HPE has suggested setting the environment variable FI_VERBS_PREFER_XRC=0 in the run script (setenv FI_VERBS_PREFER_XRC 0, for csh; export FI_VERBS_PREFER_XRC=0). This has resolved the error in our tests. Please add this variable to your run script(s) if you also hit this error. Please note that we do not see any issues preemptively setting this environment variable."
Now that I think the MPI_Finalize issue is resolved, I am going to adjust the resources and test a little more. I'll let you know when I have my final changes in place for you to look over.
Excellent! Thank you @DavidBurrows-NCO for working through various issues. |
@RussTreadon-NOAA Quick question...if a particular test fails, but I check all the stdout, and they return rc=0...does that typically mean the job took too long to run? I assume there are set run time values for each test? Thanks |
Unfortunately, the checks in GSI ctests are not very robust. Some of the timing and memory usage checks can yield false positives. The test Failed but a check of the results does not indicate a problem. Since GSI has no code manager and we are transitioning to JEDI, it's unlikely the GSI ctests will be cleaned up to yield more consistent results. If there's a particular failure you'd like me to look at, give me the path or the rundir and I'll take a look. |
3b98cf4
Gaea C5 and C6 ctests Gaea C5
The
Here are the
Indeed the loproc_updat wall time is considerably greater than the loproc_contrl wall time. Note, however, that the updat and contrl tests use the same Gaea C6
The
Here are the wall times from the various tests
For both tests the hiproc_updat wall time is notably greater than the hiproc_contrl. This is interesting. The updat and contrl runs use the same executables. he wall time differences reflect differences in system load, i/o speed, or other aspects of the system. This is not a fatal fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please reduce the Gaea C6 wall clock limit for rrfs_3denvar_rdasens
from 0:60:00
to 0:15:00
.
This PR is awaiting the return of WCOSS2 to developers so WCOSS2 ctests can be run. Assuming reduction of the Gaea C6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Approve.
8cf6434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve.
@DavidBurrows-NCO , NCO said the Cactus upgrade encountered some issues which they are working through. I'm not sure when the development WCOSS2 machine will come back online. While we wait, I installed this PR on both Dogwood and Cactus. I'm ready to run on when either machine becomes dev. |
@RussTreadon-NOAA Thanks for the info, and thanks for your quick back and forth with this PR. Have a good weekend! |
WCOSS2 ctests Install
|
Resolves #799
Type of change
How Has This Been Tested?
Cloned and built on Gaea-C5, Gaea-C6, and in a container.
Checklist