-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C-grid coupled validation #721
Comments
I believe I have a set of changes for my cgrid branch which reproduce our pre-cgrid baselines. The changes are shown in this comparison. I'm not sure that these are the minimum required changes however, so I'll do some more testing. |
@DeniseWorthen, WOW! That's great. Looks pretty reasonable to me. I think I understand these changes. It looks like it's mostly making the uocn/vocn/ss_tltx/ss_tlty consistent with what was there before. But then there are a few missing halo updates. We will have to decide how to proceed. Obviously, you can do whatever you want in the cap. The changes in ice_grid should be fine, although I wonder why testing in standalone didn't pick up that issue. The halo updates there and in ice_dyn_evp should affect all cases to some degree. The main thing we'll have to discuss is the change of "F" to "S" mapping in ice_dyn_evp.F90. I think standalone will want to stick with "S" and I think "S" is the correct mapping for those state fields. But the mapping was "F" for coupled systems. Do we need to add a namelist to allow the user to control the mapping type? Do we all agree "S" is the right thing to do and so coupled results will change? Again, many thanks for your efforts!! |
Thanks so much for this. I am still working on getting the C-grid version working in CESM. I hope to have that before the end of the week. |
I've tested and found I can remove the added HaloUpdates in ice_dyn_evp (for tmass and aice_init), which I anticipated. However, I cannot remove the added HaloUpdates in ice_grid for tarea, tarear, uarea, uarear and retain B4B. It does seem curious that those aren't getting picked up as required in the standalone. I agree the main issue is the F/S mapping and that S may be correct so we need to accept B4B changes. I'm trying to understand why they would have been mapped as fluxes to begin with. |
Thanks for the update @DeniseWorthen. Sounds good. Part of the reason the "F" was used before is because that's what existed for averaging T to U (and vv). There really wasn't an "S" except for regular rectangular grids. Going thru the C grid implementation highlighted a number of small issues in handling of the grid and forcing. The grid discretization is actually quite poor, and we plan to provide an option to improve this, #653. We also found issues in the locations of a few fields (T vs U) largely related to forcing/coupling, and we added several new options for handling grid averaging to improve that. As we migrated code that lived in the standalone forcing or in the coupling layer to a more central location with more general capability to handle A/B/C/CD coupling/forcing fields vs internal ice fields , it's probably not too surprising that we're running into a few of these issues. Ultimately, it's up to you and @dabail10 to decide what you need/want for your coupled systems. I do think "S" is the better interpolation for the ocean velocity and surface tilt terms, but happy to discuss further and ultimately support multiple options for these issues. |
FYI, I have the current version of CICE (with c-grid) running in RASM. Answers are different, but haven't look beyond that yet. Will do some more testing, mainly to confirm the cap is OK. I think with @DeniseWorthen efforts, we have a good handle where answers changes are coming in. I did find the problem noted by @dabail10 in terms of "USE_NETCDF" in the io_pio2 query var routine. We will want to fix that at some point in one of the PRs. I may create a PR for "RASM fixes", still TBD. |
Just an update here. I am having a hard time getting the C-grid code to run with CESM. Actually, I can't even get the version before the C-grid merge to work. It could be a CAP issue, but I wonder if it also might be the OMP updates. I will try to make some progress on it in the next couple days. |
More info on RASM. If I run an f19_g16 G case, I get identical answers for a 10 day run before and after the cgrid merge if I use @DeniseWorthen's mods. Without those mods, answers are different. So that is consistent with what @DeniseWorthen got. Running our 9km regional Arctic grid, the results are somewhat different. I am not seeing bit-for-bit before and after the cgrid merge even with @DeniseWorthen's mods. But, it is bit-for-bit after the cgrid merge with and without @DeniseWorthen's mods. So that is almost the opposite! I've tried to do some additional testing and still don't have a good sense why the results are turning out this way. It could be that the mods don't play a role in the 9km regional Arctic grid, it's a regular 9km x 9km regional grid. It could be that the "S" and "F" averaging is the bit-for-bit on a regular grid and that the haloupdates play no role. The slightly more concerning thing is that I haven't figure out yet why the results are different before and after the cgrid update. I'll either have to get into a debugging exercise or just accept it. My main worry is that something is not quite right for the 9km configuration in particular. I will probably spend a bit more time debugging this. But separately, I think it's good news the f19_g16 case in RASM is bit-for-bit once we include @DeniseWorthen's mods. Further confirmation that we understand the differences in general. |
I have things sorted out in RASM. I am able to produce bit-for-bit before and after the C-grid merge. We are using eap (not evp), so that's why initial results were odd. I needed to modify the eap grid_average_X2Y from "S" to "F" to get uocn/vocn/ss_tltx/ss_tlty to be consistent with the prior implementation. I will create a PR in the next few days that will update the mct/cesm1 cap as well as a few other things in CICE that I (and @DeniseWorthen found). There is nothing different relative to what @DeniseWorthen found. I think this is all good. I will leave the grid_average_X2Y setting of ocean velocity and surface tilt as "S". This will not be bit-for-bit with the pre C-grid version. If anyone feels this should be an "F", let me know and we can discuss further. @dabail10, if you need some help with CESM, let me know. |
I'll email you separately with the status of the CESM sandbox. I am leaving on vacation to Hawaii from June 1-8. I'll be back in the office on the 9th. |
Sounds good @apcraig. Thanks. I also will be out June 3-June 10 but if a PR is open, I will try to review prior to leaving. |
Denise, can I get your CAP changes to get running in UFS? I have a feeling I am missing something in the CESM cap. Dave |
@dabail10 Not quite clear on your question. Are you asking for my branch? This is what I have relative to Consortium main for the fixes I found: https://github.com/DeniseWorthen/CICE/tree/feature/addCgridfixes. This has the Let me know if you're looking for something else. |
That's it! My cap is aborting with a weird NUOPC error. I think it is some CESM specific things. However, I wanted to double check with yours. |
I'm doing more testing and am having some issues with the "new" "uarea, tarea, uarear, tarear" haloupdates in ice_grid.F90. Those halo updates were removed a while ago to fix a problem in bit-for-bit with different land block elimination situations. I think the issue is that for trigrid, we really do need to update the trigrid seam for those arrays. Right now, we compute the values locally. @DeniseWorthen, is that what you think too? Are you certain those halo updates are needed to get bit-for-bit? It looks to me like the global_scatter of the grid lengths, dxt, dxu, etc, should address the tripole seam. And then we should get the correct areas on the seam without a halo update. I'd like to understand the situation better and why those haloupdates are needed. Thoughts? |
I've done more testing, the haloupdate of tarea, uarea, tarear, uarear in ice_grid.F90 that @DeniseWorthen proposes causes problems in cases where we have different decompositions with land block elimination. That's a rare situation, but something that was fixed a few months ago. And we added several tests that explicitly test this situation. The issue is that halo data is required for a few fields (uarea, tarea) where there may not be a neighbor block due to land block elimination. So, in those cases, we explicitly calculate the field on the entire block including the halo and avoid the halo update. If we turn on the halo update again, it sets the halo to values of 1.0 where there is no neighbor. And that creates differences with different decompositions. All of my tests in RASM and in standalone CICE, including tx1 which is a tripole grid work correctly without the haloupdate of tarea, uarea, tarear, uarear. @DeniseWorthen, is the UFS grid tripole or tripoleT? Could you email me your UFS ice_in namelist. My guess is that the UFS changes answers when the haloupdates are excluded because there is an error in the baseline solution. As I mentioned above, when I add the haloupdate in the CICE test suite, 99% of the results are bit-for-bit, but a few are not and they fail in a way I expect because of the interaction of the land block elimination and the halo. I think that is probably happening in UFS. There are a few ways to confirm. If we could look at the first subcycle where things diverge in UFS, we would probably see a difference on a single or very limited number of block edges. Over the subcycling, the error grows and expands, but it may still be possible to see the source of a difference after one model timestep in history files, especially, if, for instance, we just ran a few timesteps with ndte=1 or 2. Another way to check would be to run a test with
in ice_in with the haloupdates included. This will stop all land block elimination. It should be bit-for-bit with whatever distribution_wght (latitude?) is used by default, but my guess is that it will not be. Then a followup test would be to remove the new haloupdates and run two more tests with the two values of distribution_wght. I think those will be bit-for-bit. That would demonstrate that an error that existed in the baseline UFS simulation is fixed. It means we understand why these haloupdates are needed in UFS to get bit-for-bit and that we don't want to include them. @DeniseWorthen, if you want me to help test any of this, let me know. I can work on cheyenne or hera, just need a sandbox and instructions how to setup the test case. |
Thanks for all that info @apcraig. I'm not going to have time to do anything with this before leaving, so if you could do some more testing that would be appreciated. I have a sandbox on cheyenne and I'll send you an email with how I build and run using this. That will have our ice_in (short answer, we use ns_boundary_type = 'tripole' and distribution_wght = 'latitude'). |
I ran 4 tests of UFS on cheyenne. Combinations were with the uarea, tarea, uarear, tarear haloupdate in ice_grid.F90 on and off plus with distribution_wght = latitude and blockall. As I expected, three of the solutions are bit-for-bit. The one that isn't is haloupdate "on" and distribution_wght=latitude. Apparently, that is the solution the reproduces the baseline run bit-for-bit. That solution has a bug in it. It should be bit-for-bit with "blockall" but isn't. So, I plan a PR that excludes the haloupdate changes in ice_grid.F90. So there are potentially two changes that are not bit-for-bit in UFS
I hope that make sense. You can see the 4 cases here,
rundir.base is the case that's not bit-for-bit with the others. |
Thanks for this work, @apcraig. Something I don't understand is why the differences between rundir.base and any of the other three (say rundir.base.blockall) show a coherent pattern---they don't just show a "noise" difference between the two runs. So, for example, this is the same difference field I was making above but it no longer has any scaling factor applied (I had been using 1e14). |
I think the question of difference patterns and coherency is a relatively broad question. In these cases, the error is going to first be created on the halo at probably just a few points. That error will be bigger than roundoff, but probably not climate changing. Then it will propagate into the solution overall. In my experience, even if a roundoff error is introduced into a solution, the difference isn't usually noise. The solution is likely to be the same climate, but it's a slightly different version of the same climate with coherent "weather" features that are now shifted in time or space. The difference will look coherent. If I think about differencing two solutions that have a roundoff difference introduced at 1 gridcell at 1 point in time, then I think that might look like noise propagating out from the difference for the first few timesteps. After a few hours/days/weeks/months, the solutions are similar but have coherent "weather" structures displaced in time/space/strength, so that difference is coherent. The question is what does the difference look like in between. I would argue the difference is likely to be coherent fairly quickly. I'm not sure at what timestep the difference above was generated. In this case, the error introduced is being reinforced at the same location at each timestep. And these runs were based on ice_in.ori, so full subcycling. I think what we'd really need to do is run with minimal subcycling and look at the difference after one timestep to see what the initial perturbation looks like. But overall, I'm not surprised the difference is not simply noise. But ultimately, I believe the difference will still produce the same climate. |
I can't replicate your results for rundir.base vs rundir.base.blockall. Both of these are using the same executable (using the addCgridfixes branch, which should match your "halo on" case). The only difference is in the distribution_wght switched to blockall. I find they are B4B; I tested both ndte=2 and our original ndte=120. My run directory is (The most recent figure I posted was a difference imported to the mediator at the end of the first timestep). |
I can't explain why the solutions don't diverge with ndte=2. I would expect they would. I assume if you run with ndte=120, you are able to reproduce my solution? In the images near the top of this thread, you show differences with ndte=2, but now you are not seeing those. What's the difference in those runs? I guess the first set of plots above also include the difference in "S" vs "F"? |
#726 is my PR. This includes one other fix that has not yet been tested in UFS. That is a fix to the averaging of ss_tlt. I believe this will also be answer changing and is also a bug fix. |
Were you testing your PR #726 in your cgridtests/cgrid run directories? That is, was it included in what you compiled. |
The UFS tests I did were just with the branch you pointed me to. I made no code changes except to turn off the 4 haloupdates in ice_grid.F90. The main point of my UFS testing was to try to prove that the baseline solution had an error that was fixed by not including the 4 haloupdates in ice_grid.F90. #726 was tested in RASM and in standalone CICE and I think is consistent with the UFS branch except for the update to the ss_tlt averaging. |
The very top figures are for my initial test of pre-cgrid vs post-cgrid code base. So they included the original code to use "S" for the X2Y average of ocean velocities and the tilts. The figures I showed today are for the ufs branch I pointed you to, so it is the branch which regains B4B for UFS (by adding the halo updates, switching S->F, and changing the haloUpdate type to vector in ice_import_export). It sounds like should be identical to what you tested for the cases where you left the halos "on". My rundir.base does reproduce your rundir.base. But my rundir.base.blockall does not reproduce your rundir.base.blockall. |
The executable in your rundir.base.blockall seems to be linked to the wrong place:
|
Yes, you are probably right, I copied in your rundir and updated the ice_in and job script, but may have forgotten to change the binary link. Let me redo these tests. That suggests my original conclusions were incorrect and I need to do a bit more digging. |
I understand the issue now. I ran two side-by-side tests with UFS, with the halo on and off.
These have snapshot ice output, ndte=2, diagfreq=1. There is a tool on cheyenne, cprnc, that does quick comparisons of two netcdf files. If you do
you'll see that the ice history file is bit-for-bit after the first timestep and only 3 fields are not bit-for-bit after the 2nd timestep. And only 3 fields are not bit-for-bit in the first cpl.hi.ice file, Fioi_taux/tauy and Si_vsno. Si_vsno is roundoff different. taux/tauy are a bit larger. If you look at the Fioi_taux/tauy after that first timestep, the difference is only at the tripole seam. I modified the code to write out the tarea and uarea at the last few j gridpoints in ice_grid.F90,
and that difference shows uarea is different on the trigrid seam in the halo at roundoff level, i.e.,
The two cases differ by a tiny amount. In one case, the areas are computed independently in the halo. In the other case, the areas are copied across the tripole seam. While the results should be the same, they aren't. There is a roundoff difference in computing uarea on different sides of the seam. This seems to primarily impact the strocn[x,y]T output field, which is diagnosed in ice_dyn_evp.F90 and ultimately averaged to the T grid,
I believe this U2T "F" averaging is where the difference really comes in and without the coupling, CICE might remain bit-for-bit. It is bit-for-bit in CICE standalone with a tripole grid. So, I think I understand things now and it makes more sense that the land block elimination I suggested earlier, which was just wrong. The question is what to do. This is basically a roundoff difference in some grid parameters across the tripole seam. The science won't change. But, we really do want the tripole seam to be bit-for-bit symmetric, so I think this is something that needs to be fixed. At the same time, we do NOT want to update halos in tarea/uarea in general, as this breaks other features. We just want the tripole seam to be consistent. I think we need to add a "haloUpdateTripole" that just does the tripole seam and we need to call it for the uarea and tarea fields in ice_grid. It could be that there are other asymmetries in the tripole seam implementation that also need to be fixed, but we don't see those at the moment, so this is a start. It would be interesting to actually check the results on the tripole seam and verify they truly are symmetric in a proper simulation. But I'm not going to do that now. I will implement the proposed changes, test, and update the PR. |
This is really a nice verification of what is happening. Thanks! |
@apcraig @dabail10 I've merged the latest main and tested again. If I retain the "F" setting for the ocean velocities and tilts, I can reproduce our current baselines (as expected). That's a nice clean result. Dave--what is the status of the wave-coupling PR? From my perspective, I would prefer to update NOAA-EMC with just the changes to the NUOPC cap for the C-grid rather than combining with the wave-coupling. The changes required to main for the nuopc/cmeps are pretty minor; these are the changes I've tested: https://github.com/CICE-Consortium/CICE/compare/main...DeniseWorthen:feature/addCgridfixes?expand=1 |
@DeniseWorthen, thanks for testing the latest. I also agree that it's probably good to update the cap, so it's working properly for the current version. Then as a separate step, add new coupling features. One option would be for @DeniseWorthen to create a PR? Then #717 could be closed and @dabail10 could update his sandbox, verify it works as is, then work from that? Or #717 could morph into just the needed changes. Ultimately, I leave it up to @dabail10, @DeniseWorthen to decide how to best proceed since this is primarily related to the nuopc/cmeps cap. Let me know if I can help with anything. Thanks. |
I'm getting a failure when trying to run w/ the c-grid turned on in debug mode. I'm going to need a little time to track this down before I can make any PR. The failure is on hera.intel is
|
@DeniseWorthen, that's a bummer. But you've always been good at finding these problems. We do test a handful of cases with debug flags on in the test suite with c-grid, but maybe the flags you use are more comprehensive. I do not believe we use init=nan becuase some parts of the code fail on that. We've made some efforts to debug those issues, but not at high enough priority. Let me know if I can help with anything. |
@apcraig For intel, we do compile w/ |
@DeniseWorthen, that's great. Thanks for your efforts. If you want to create a PR, I'll merge it. If you prefer I create a PR, I'm also happy to do that. |
@dabail10 Does it work for you if we merge this first and then you update w/ the changes for wave-ice coupling next? |
That works. |
I think we addressed the issues in this PR with #728. Will close now. Can reopen if I'm missing something. |
We are trying to validate the C-grid code in UFS, CESM, and RASM. Some initial discussion is here, #717. In summary, UFS is not bit-for-bit when updating to the C-grid version. We suspect this is due to changes in the coupling/forcing averaging/handling. We tested changing the uocn/vocn/ss_tltx/ss_tlty from an "S" averaging to and "F" averaging in UFS and this did not fix the problem, although it may play a role. Additional debugging needs to be done. @DeniseWorthen is working on UFS, @dabail10 is working on CESM, @apcraig is working on RASM. The UFS results show largest differences after one coupling period on the trigrid seam, so tripole vs displaced pole could play and role, and we need to consider that.
As a next step, the CICE in RASM will be updated to pre and post C-grid versions and run side-by-side to check results. UFS will be run with ndte=1 to more cleanly assess differences after one timestep. CESM is planning to create pre and post C-grid branches for validation as well.
The text was updated successfully, but these errors were encountered: