-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out non-BFB behavior from ocean split-explicit time integration driver cleanup #4152
Comments
I ran Patch to remove optimization
I first generated a baseline for commit hash I confirmed (by looking at the E3SM bldlog) that the model was compiled without optimization. I have my own local branch where I implemented the changes in MPAS-Dev/MPAS-Model#781 incrementally, so I can start a bisecting to see if I can narrow down the source of the non-BFB behavior. |
@mattdturner, the oEC60to30v3 mesh is hefty enough that we don't include it in our standalone test suite. It seems like we do need to include a QU240 test case that has exactly the same namelist flags as oEC60to30v3 in E3SM to catch this kind of problem earlier. |
@mattdturner and @philipwjones here is a setup in stand-alone that reproduces this error. There is an EC60to30 bfb mismatch with intel optimized, but not for any of intel debug, gnu opt or gnu debug.
You will see mismatch like this on the second one:
Could one of you copy those two directories and verify you get the mismatch? They point to EC initial conditions in my directories, but it should all be accessible. Note you need to alter run.py in the second one with the path of the first to do the comparison right. There is a QU240 in the path name right now, but it is actually testing the EC60to30. If so, it would be easy to start with the first commit bcde86f and make the incremental changes with vimdiff - first lines that are only spacing, test and commit, then lines that are only removing mesh variables. Those should all still match bfb, and we can see what is left. |
This doesn't solve the BFB issue, but
As the code currently stands, I'm assuming the tests in MPAS-O nightly suite only run with |
@mark-petersen - forgot to respond earlier, but yes, I verify that your configurations above produce a mismatch and also that QU240 (even with mesh scaling) does not. Gone through a couple of the mods but still haven't found the change that triggers this yet. |
If I apply the following patch to Baseline
Test
SE Cleanup BFB Patch
From the patch, I would say that the
The rest of the changes in the patch are removing the use of I also can't yet guarantee that the patch is the minimum amount of changes necessary to get BFB. |
I can confirm that this fix works in my case too. I also did a comparison of the two forms and they are within roundoff at every step - had to push the abs tolerance down to 10^-18 to pick up any differences in the two forms since uTemp is typically around 10^-3. Still kinda disturbing that this has such a large impact on final results though - this model is extremely sensitive to vel changes. |
You can actually get bit-for-bit too if you put parentheses around the second two (k-dependent) terms. So it looks like in the array syntax form, it's actually pulling out the k-independent first term and doing the sum of the k-dependent terms separately. |
@philipwjones The Should we create a PR to apply the patch to |
@mattdturner - yeah, I couldn't figure out how the init worked either so definitely a PR for that and may as well make the paren change to ensure b4b. There's probably a way we could shift the init call in the forward driver but for know it's easier to just use the pool. At some point further down the line, need to work out the dependencies during init. |
OK. I'll go ahead and create a PR for those changes |
In my testing the PR (not yet created), it appears that the parenthees on the |
Too bad...I guess go back to the array syntax version if it preserved bfb for all tests. |
This PR reverts a subset of the changes introduced in #781 that caused non-BFB results for a few of the tests in E3SM. See discussion E3SM-Project/E3SM#4152 for more information. Changes include: - No longer use ocn_mesh in ocn_time_integration_split_init. This routine is run before ocn_meshCreate is run, so the variables in ocn_mesh would be undefined. I'm not sure why previous testing didn't catch this. - Revert one of the uTemp calculations to the old array syntax notation
Update mpas-source: split explicit bugfix Brings in a new mpas-source submodule with changes only to the ocean core. It reverts a subset of the changes introduced in #4146 that caused non-BFB results for a few of the tests in E3SM. See MPAS-Dev/MPAS-Model#826 Fixes #4152 [non-BFB] for intel tests using the oECv3 mesh
…n/develop This PR reverts a subset of the changes introduced in MPAS-Dev#781 that caused non-BFB results for a few of the tests in E3SM. See discussion E3SM-Project/E3SM#4152 for more information. Changes include: - No longer use ocn_mesh in ocn_time_integration_split_init. This routine is run before ocn_meshCreate is run, so the variables in ocn_mesh would be undefined. I'm not sure why previous testing didn't catch this. - Revert one of the uTemp calculations to the old array syntax notation
This PR reverts a subset of the changes introduced in #781 that caused non-BFB results for a few of the tests in E3SM. See discussion #4152 for more information. Changes include: - No longer use ocn_mesh in ocn_time_integration_split_init. This routine is run before ocn_meshCreate is run, so the variables in ocn_mesh would be undefined. I'm not sure why previous testing didn't catch this. - Revert one of the uTemp calculations to the old array syntax notation
PR #4146 resulted in unexplained non-BFB results for some, but not all, configurations using the intel compiler in both optimized and debug mode. Because the impacted configurations include the current WaterCycle default, it is critical we understand the cause of the answer changes before moving past.
The text was updated successfully, but these errors were encountered: