-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocate c and cd var in evp, reduce number of "if grid_ice". #778
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks okay to me, at first glance. There must not really be a conflict between the _loc variables and the nonlocal ones, since it's BFB -- is that change necessary or helpful for 1Devp? I'll approve but would like others to take a look too.
How close is this to just having separate subroutines (or even modules) for B, C and CD grids?
Do we know how much this improves performance? |
I did not check performance and it may not be significant, however: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems reasonable.
One comment. Rather than create a new method (alloc_dyn_evp), why don't we put the alloc calls in init_dyn. That way, we don't have to modify all the drivers. To be honest, for consistency, as part of that, I would migrate alloc_dyn_eap into init_eap too and get rid of alloc_dyn_eap completely. That moves the alloc into the "init" phase of all the dynamics options, where there are already alloc statements, makes things consistent, and ultimately reduces the number of calls and code changes that might have to be made in the drivers in the long term.
This also raises the question of whether init_dyn should be renamed init_evp and moved into dyn_evp and whether we need to review what data is in the dyn_shared module vs dyn_evp/eap/vp, what's common and what's not. We could add an init_evp subroutine and the init_evp/eap/vp could call init_dyn for common data allocation rather than having that call be in the drivers. For now, I also think we can add "if kdyn == 1" in init_dyn if needed as a first step. Overall, I'd prefer if we migrate towards the drivers calling "init_evp/eap/vp" only and then having those methods take care of allocs as well as calling init_dyn in the shared method. Can we at least not add alloc_dyn_evp now? Thoughts?
That makes sense. I will update and move the allocation to dyn_init. |
A reflection on this. |
Right, any data allocated in ice_dyn_shared should live in ice_dyn_shared. Data in evp/eap/vp should live in those modules and be allocated in init_evp/eap/vp. At the top level, having something like this
And then each init_evp/eap/vp would/could call init_dyn_shared which lives in ice_dyn_shared. That makes sense to me and seems cleaner than what we have now. thanks! |
@TillRasmussen, I think we're waiting for a round of updates. If you need me to help, just let me know. No rush, just checking in. Thanks. |
EAP and EVP are bit for bit. |
@TillRasmussen, just trying to understand what was done in your testing. I assume you ran a test suite and added -s dynpicard to turn on vp for all tests? And you created a baseline first "picardrefac" that didn't have your mods in it? What are the bfbcomp results for the picardrefac version before your changes? If they are also failing, then we need to look into the dynpicard a bit more. If those were passing, then there must be a problem with the process. |
That is correct:
My reference run is, which is more
freyja-6 CICE/testsuite.picardrefac> git branch
* main
freyja-6 CICE/testsuite.picardrefac> git remote -v
origin ***@***.***:TillRasmussen/CICE.git (fetch)
origin ***@***.***:TillRasmussen/CICE.git (push)
I realized yesterday that phillipe made a push 21 days ago. Therefore I
reran the baseline with the update code.
However this did not make a difference.
The two baseline runs are in
results_updated.log (after code update)
results_baseline.log (before code update
They were executed with:
./cice.setup --suite quick_suite,decomp_suite -m freya -s dynpicard
--testid picardrefac --bgen picardrefac -e intel,gnu
The result is 82 bfb errors in the baseline and the same errors plus some
baseline exist in the updated run.
For the test and comparison (results in results_test.log) the following
experiment was carried out.
./cice.setup --suite quick_suite,decomp_suite -m freya -s dynpicard
--testid testpicardrefac --bcmp picardrefac -e intel,gnu
The same 82 test seems to fail.
On Fri, Nov 11, 2022 at 12:09 AM Tony Craig ***@***.***> wrote:
@TillRasmussen <https://github.com/TillRasmussen>, just trying to
understand what was done in your testing. I assume you ran a test suite and
added -s dynpicard to turn on vp for all tests? And you created a baseline
first "picardrefac" that didn't have your mods in it? What are the bfbcomp
results for the picardrefac version before your changes? If they are also
failing, then we need to look into the dynpicard a bit more. If those were
passing, then there must be a problem with the process.
—
Reply to this email directly, view it on GitHub
<#778 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH7N7DSPUX6WWQZN7ZJF5V3WHV6ELANCNFSM6AAAAAARL4A4YU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[results_baseline.log](https://github.com/CICE-Consortium/CICE/files/9990071/results_baseline.log)
[results_test.log](https://github.com/CICE-Consortium/CICE/files/9990073/results_test.log)
[results_updated.log](https://github.com/CICE-Consortium/CICE/files/9990074/results_updated.log)
|
This looks fine to me. Several things were moved around, but I think it works well. I don't love the "_loc" variables in the stressC* routines. I don't think there is a problem using the module variable name for a local variable. The compiler will keep it straight. The other thing is that maybe all that module data should be moved into subroutine evp and allocated like uocnU (which would not select for C or CD grid). I know this is also not ideal. Or it could be allocated on first_call. I know you don't like this, but it's more consistent with the allocation of uocnU. To make things even more consistent, I would treat uocnU and the other variables in that group the same way as the C/CD variables, make them allocatable and alllocate on first call or make them module data and allocate in init_evp. Or we could remove those variables from the calling arguments and just access the module data. Many options to clean this up further. This could also be a next step after this PR. I will run a test suite on cheyenne. But as I understand it, there still may be some problems with bit-for-bit on different pe counts and decomps for vp. As a general rule, we don't run test suites with the -s option, so we don't know which of those work or don't. And it's possible some of those are not setup properly or some of those are actually failing. We do have a test in omp_suite for dynpicard that should be picking up these bfb failures for different decomps, but maybe it's not comprehensive enough. Let me run some additional checks just to confirm. |
I encourage @eclare108213 @phil-blain @JFLemieux73 to have a look at these mods, it would be a step towards revising the dynamics driver, although largely just deals with the initialization. |
I ran a full test suite and everything passes except the sumchk and gridavgchk unit tests. Those unit tests have a CICE_InitMod.F90 that needs to be updated. Please update that code. I did not test the decomp_suite with dynpicard but will try that next. I assume I'll get some failures and we can create an issue for them if needed. Stepping back a bit, it would be really nice to reduce the amount of redundant code in the various drivers, pushing more of the init and run sequence out of the drivers. But that may not be an easy task. Maybe we should create an issue for that too though. |
@apcraig and others: Failing test: Is it prefered to call the alloc_dyn_shared from CICE_Init or from init_dyn_shared ? With regards to allocation then I prefer to do it in the init_evp as this is a "first call" functon. There is no need to add additional if statements. |
I think we want to leave CICE_init as it is now. It looks clean now, we don't want to add more calls to that. The second issue is if we want to initialize the arrays in a different routine than evp, then the data has to be module data. If we initialize the data in evp via first_call, then that data can live in the subroutine. I don't have a strong preference. I am not worried at all about a single "first_call" if test in evp. It's executed once per timestep, not once per subcycling or once per block or once per gridcell. The cost would be zero to do that if test. But, I'm also fine leaving the module data and init subroutine as it is now. Lets remove the _loc strings and verify it doesn't create any problems. A final option which I was reluctant to suggest before would be to just get rid of arguments that exist as module data in evp. We really don't need to pass those down, they exist in the module already. But I think that'll create some asymetry in the B and C grid implementations that I'm not that keen on. So, I would say
|
@apcraig |
On several machines, the sumchk fails in those tests because MPI_16 isn't implemented correctly. Have a look at the end of cice.runlog file for a summary of the tests. My guess is that it's failing on the "lsum16" tests. If this is happening, it's not unusual. It passes on many machines, fails on others, must depend on the MPI implementation. What you'll see above the final results are errors like this,
indicating it's the lsum16 option that's failing in the sumchk. So you should get the same results with your updated code and that's fine. I think it's worth testing the lsum16 option even if it fails on some machines. Feedback should be provided to the MPI implementation team. |
*loc removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me. I definitely like making the structure of the evp/eap/vp modules/routines more similar, reusing code where possible, moving 'if (first_call)' tasks to init routines, and pulling if statements out of loops where possible. Thanks also for cleaning up column alignments, capitalization, etc.
The 'Conversation' in this PR is long and a little confusing -- some of what's listed in the initial PR template is no longer true, and other things changed that are mentioned later. So my main suggestion at this point is to make a concise list of what's actually in this PR, and edit the first (template) entry to include that list, e.g. under "additional information".
Beautiful, thank you. This can be merged with approval from @apcraig or @phil-blain or @JFLemieux73. |
Retested on cheyenne, new problem. All boxslotcyl tests all fail. Looks like it's seg faulting here,
My guess is that something isn't being allocated that was before? A simple test to debug should be
boxslotcyl seems to be the only test with "kdyn=-1". I think that means the dyn initialization is not being called and moving some of those arrays to allocatable is causing the seg fault. I guess the question is how to fix this. My preference would be that if kdyn=-1, work1 at line 222 is set to 1 everywhere and then written. Or maybe not write iceumask if kdyn=-1 and then add a query_field for iceumask in the read restart subroutine. |
The runtime of the CI suite has gone over 2 hours in this version and is still running. A previous run (https://github.com/CICE-Consortium/CICE/actions/runs/3457289294) finished in 4 hours (!). Might be worth investigating? |
OK, the previous run was indeed 4 hours but it was the "download input data" step that took 4 hours: https://github.com/CICE-Consortium/CICE/actions/runs/3457289294/jobs/5770617947. But for the current one (https://github.com/CICE-Consortium/CICE/actions/runs/3462290692/jobs/5780984937) it's really the "run suite" step that is at 2 hours and counting... |
I restarted GHActions. I think it must be a cloud problem, but will try to keep an eye on it. |
Alternatively always allocate iceUmask, iceEmask, iceNmask through init_dyn_shared. I think that these are the only arrays that are used outside kdyn, which originates from ice dynamics |
I think the problem is that init_dyn_shared will not be called if kdyn=-1. iceUmask is already allocated anytime init_dyn_shared is called and that's the first array that is trapped as unallocated later in the run. Either treat iceUmask like iceEmask and iceNmask in the restart code. Or check for kdyn=-1 in the restart code. Or revert how iceUmask is allocated (but I prefer to continue forward if possible). |
Do we want to account for these changes when using restartfile_v4? |
I think we can leave restart_v4 as is. It's not part of our testing coverage and should probably be deprecated at some point. In addition, restart_v4 should be fine unless someone tries to restart with a v4 file AND kdyn=-1, not likely to happen. |
One other thing, for information. dynpicard is not bit-for-bit with different pe counts and decomps unless you also turn on bit-for-bit global sums. So you might want to run a test suite forcing dynpicard with
I believe this will pass bfbcomp tests above that failed with just -s dynpicard. |
writes restart files including iceumask, icenmask and iceemask when these fields are not allocated. Primarily for binary read of files. Skip reading when they dont exist. |
Is this fact noted in the documentation, e.g. in the testing section and/or Troubleshooting? |
I added a note about it in the "developer guide", in the section about global sums: https://cice-consortium-cice.readthedocs.io/en/main/developer_guide/dg_other.html#reproducible-sums |
Thanks, they are now bit for bit |
Ran a test suite on cheyenne, pio is hanging. I think it's the query_field subroutine. Let me try to debug a bit. More soon. |
OK, I think we need one more modification. This is my fault, query_field was never tested in pio. That subroutine in cicecore/cicedynB/infrastructure/io/io_pio2/ice_restart.F90 needs to be changed original:
new:
Basically, the pio_inq_varid should not be inside a master_task if check and then there doesn't need to be a broadcast. @TillRasmussen, could you make this change please. Thanks! |
36b8463
to
fffe822
Compare
Thanks @TillRasmussen! I think this is all working well now. I want to run a few more tests, then I think we can merge. I think this is a nice cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing looks good.
For detailed information about submitting Pull Requests (PRs) to the CICE-Consortium,
please refer to: https://github.com/CICE-Consortium/About-Us/wiki/Resource-Index#information-for-developers
PR checklist
@TillRasmussen
results.log
Rename of variables in stress subroutines (added _loc) in order to avoid potential conflict with modul variables with same name
[edited to remove spaces from checkboxes so that they appear correctly]