Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out non-BFB behavior from ocean split-explicit time integration driver cleanup #4152

Closed
jonbob opened this issue Mar 10, 2021 · 12 comments · Fixed by #4160
Closed

Figure out non-BFB behavior from ocean split-explicit time integration driver cleanup #4152

jonbob opened this issue Mar 10, 2021 · 12 comments · Fixed by #4160
Assignees
Labels
bug Critical mpas-ocean non-BFB PR makes roundoff changes to answers.

Comments

@jonbob
Copy link
Contributor

jonbob commented Mar 10, 2021

PR #4146 resulted in unexplained non-BFB results for some, but not all, configurations using the intel compiler in both optimized and debug mode. Because the impacted configurations include the current WaterCycle default, it is critical we understand the cause of the answer changes before moving past.

@mattdturner
Copy link
Contributor

mattdturner commented Mar 10, 2021

I ran SMS_D_Ld1.ne30_r05_oECv3.A_WCYCL1850.chrysalis_intel without optimization (using the following patch to disable optimization)

Patch to remove optimization
diff --git a/cime_config/machines/config_compilers.xml b/cime_config/machines/config_compilers.xml
index dc38457fd..851e5e5db 100644
--- a/cime_config/machines/config_compilers.xml
+++ b/cime_config/machines/config_compilers.xml
@@ -835,6 +835,7 @@ flags should be captured within MPAS CMake files.
     <append COMP_NAME="gptl">-DHAVE_SLASHPROC</append>
   </CPPDEFS>
   <FFLAGS>
+    <base>-O0</base>
     <append DEBUG="FALSE">-O2 -debug minimal -qno-opt-dynamic-align</append>
   </FFLAGS>
   <PIO_FILESYSTEM_HINTS>gpfs</PIO_FILESYSTEM_HINTS>

I first generated a baseline for commit hash dd5895d0a, and then ran a test with hash 51e9f38ef that compared against the baseline. The comparison FAILed, which surprised me since all MPAS-O standalone tests with Intel + -O0 were BFB.

I confirmed (by looking at the E3SM bldlog) that the model was compiled without optimization.

I have my own local branch where I implemented the changes in MPAS-Dev/MPAS-Model#781 incrementally, so I can start a bisecting to see if I can narrow down the source of the non-BFB behavior.

@xylar
Copy link
Contributor

xylar commented Mar 10, 2021

The comparison FAILed, which surprised me since all MPAS-O standalone tests with Intel + -O0 were BFB.

@mattdturner, the oEC60to30v3 mesh is hefty enough that we don't include it in our standalone test suite. It seems like we do need to include a QU240 test case that has exactly the same namelist flags as oEC60to30v3 in E3SM to catch this kind of problem earlier.

@mark-petersen
Copy link
Contributor

@mattdturner and @philipwjones here is a setup in stand-alone that reproduces this error. There is an EC60to30 bfb mismatch with intel optimized, but not for any of intel debug, gnu opt or gnu debug.

# base case, before SE update, bcde86f8
cd /lustre/scratch3/turquoise/mpeterse/runs/nightly/ocean_model_210309_bcde86f8_gr_intel-mpi_19_openmp_b4_seUpdate/ocean/global_ocean/QU240/performance_test
./run.py
# after SE update, comparison, f54e7ede
cd  /lustre/scratch3/turquoise/mpeterse/runs/nightly/ocean_model_210309_f54e7ede_gr_intel-mpi_19_openmp_after_seUpdate_compare/ocean/global_ocean/QU240/performance_test
./run.py

You will see mismatch like this on the second one:

 * Running forward
 - Complete
Beginning variable comparisons for all time levels of field 'temperature'. Note any time levels reported are 0-based...
0:  l1: 3.29513470481893e-07  l2: 3.81848138860394e-03  linf: 9.99365096034932e-04

Could one of you copy those two directories and verify you get the mismatch? They point to EC initial conditions in my directories, but it should all be accessible. Note you need to alter run.py in the second one with the path of the first to do the comparison right. There is a QU240 in the path name right now, but it is actually testing the EC60to30.

If so, it would be easy to start with the first commit bcde86f and make the incremental changes with vimdiff - first lines that are only spacing, test and commit, then lines that are only removing mesh variables. Those should all still match bfb, and we can see what is left.

@mattdturner
Copy link
Contributor

This doesn't solve the BFB issue, but refBottomDepth should be removed from the following line because we are getting that pointer from ocn_mesh

      real (kind=RKIND), dimension(:), pointer :: &
         refBottomDepth,         &! reference bottom depth
         normalBarotropicVelocity ! normal barotropic velocity

As the code currently stands, refBottomDepth will be an unassociated pointer.

I'm assuming the tests in MPAS-O nightly suite only run with config_filter_btr_mode = .false.? That's the only reason I can think of why this wouldn't have been an issue in testing.

@philipwjones
Copy link
Contributor

@mark-petersen - forgot to respond earlier, but yes, I verify that your configurations above produce a mismatch and also that QU240 (even with mesh scaling) does not. Gone through a couple of the mods but still haven't found the change that triggers this yet.

@mattdturner
Copy link
Contributor

If I apply the following patch to components/mpas-source then the results are BFB for SMS_Ln6.ne30_r05_oECv3.A_WCYCL1850.chrysalis_intel when compared to a baseline that I generated.

Baseline

  • generated just before sedriver branch merged
  • E3SM hash: dd5895d0a
  • mpas-source hash: e72f9e3f

Test

  • run after applying patch just after sedriver branch was merged
  • E3SM hash: aabd98361
  • mpas-source hash: ca57000e (again, patch applied to this hash)
SE Cleanup BFB Patch
diff --git a/src/core_ocean/mode_forward/mpas_ocn_time_integration_split.F b/src/core_ocean/mode_forward/mpas_ocn_time_integration_split.F
index 9cd0fddd..d33fb20b 100644
--- a/src/core_ocean/mode_forward/mpas_ocn_time_integration_split.F
+++ b/src/core_ocean/mode_forward/mpas_ocn_time_integration_split.F
@@ -1396,11 +1396,8 @@ module ocn_time_integration_split
                ! velocity for normalVelocityCorrectionection is
                ! normalBarotropicVelocity +
                ! normalBaroclinicVelocity + uBolus
-               do k=1,nVertLevels
-                  uTemp(k) = normalBarotropicVelocityNew(iEdge) + &
-                             normalBaroclinicVelocityNew(k,iEdge) + &
-                             normalGMBolusVelocity(k,iEdge)
-               end do
+                  uTemp(:) = normalBarotropicVelocityNew(iEdge) + normalBaroclinicVelocityNew(:,iEdge) &
+                           + normalGMBolusVelocity(:,iEdge)
 
                ! thicknessSum is initialized outside the loop because
                ! on land boundaries maxLevelEdgeTop=0, but I want to
@@ -2026,6 +2023,7 @@ module ocn_time_integration_split
 
       type (mpas_pool_type), pointer :: &
          statePool,      &! structure containing model state
+         meshPool,       &! structure containing mesh fields
          tracersPool      ! structure containing tracer fields
 
       integer ::         &
@@ -2034,6 +2032,9 @@ module ocn_time_integration_split
          ierr,           &! local error flag
          cell1, cell2     ! neighbor cell indices across edge
 
+      integer, dimension(:), pointer :: maxLevelEdgeTop
+      integer, dimension(:,:), pointer :: cellsOnEdge
+
       real (kind=RKIND) ::       &
          normalThicknessFluxSum, &! vertical sum of thick flux
          layerThicknessSum,      &! vertical sum of layer thickness
@@ -2057,6 +2058,8 @@ module ocn_time_integration_split
          remainder,          &! remaining time after interval division
          zeroInterval         ! zero timestep for comparing remainder
 
+      integer, pointer :: nVertLevels, nCells, nEdges
+
       integer (kind=I8KIND) :: nBtrSubcyclesI8
 
       ! End preamble
@@ -2137,8 +2140,12 @@ module ocn_time_integration_split
 
          block => domain % blocklist
          call mpas_pool_get_subpool(block%structs, 'state', statePool)
-
          call mpas_pool_get_subpool(statePool, 'tracers', tracersPool)
+         call mpas_pool_get_subpool(block%structs, 'mesh', meshPool)
+
+         call mpas_pool_get_dimension(block % dimensions, 'nVertLevels', nVertLevels)
+         call mpas_pool_get_dimension(block % dimensions, 'nCells', nCells)
+         call mpas_pool_get_dimension(block % dimensions, 'nEdges', nEdges)
 
          call mpas_pool_get_array(statePool, 'layerThickness', &
                                               layerThickness, 1)
@@ -2149,6 +2156,10 @@ module ocn_time_integration_split
          call mpas_pool_get_array(statePool, 'normalBaroclinicVelocity', &
                                               normalBaroclinicVelocity, 1)
 
+         call mpas_pool_get_array(meshPool, 'refBottomDepth', refBottomDepth)
+         call mpas_pool_get_array(meshPool, 'cellsOnEdge', cellsOnEdge)
+         call mpas_pool_get_array(meshPool, 'maxLevelEdgeTop', maxLevelEdgeTop)
+
          !*** Compute barotropic velocity at first timestep
          !*** This is only done upon start-up.
          if (unsplit) then
@@ -2162,12 +2173,12 @@ module ocn_time_integration_split
          else ! split explicit
 
             if (config_filter_btr_mode) then
-               do iCell = 1, nCellsAll
+               do iCell = 1, nCells
                   layerThickness(1,iCell) = refBottomDepth(1)
                enddo
             endif
 
-            do iEdge = 1, nEdgesAll
+            do iEdge = 1, nEdges
                cell1 = cellsOnEdge(1,iEdge)
                cell2 = cellsOnEdge(2,iEdge)
                kmax  = maxLevelEdgeTop(iEdge)

From the patch, I would say that the uTemp differences are almost certainly due to additional optimization being done by the compiler

-               do k=1,nVertLevels
-                  uTemp(k) = normalBarotropicVelocityNew(iEdge) + &
-                             normalBaroclinicVelocityNew(k,iEdge) + &
-                             normalGMBolusVelocity(k,iEdge)
-               end do
+                  uTemp(:) = normalBarotropicVelocityNew(iEdge) + normalBaroclinicVelocityNew(:,iEdge) &
+                           + normalGMBolusVelocity(:,iEdge)

The rest of the changes in the patch are removing the use of ocn_mesh in the initialization routine for split explicit. I'm not sure yet why this would cause non-BFB results.

I also can't yet guarantee that the patch is the minimum amount of changes necessary to get BFB.

@philipwjones
Copy link
Contributor

I can confirm that this fix works in my case too. I also did a comparison of the two forms and they are within roundoff at every step - had to push the abs tolerance down to 10^-18 to pick up any differences in the two forms since uTemp is typically around 10^-3. Still kinda disturbing that this has such a large impact on final results though - this model is extremely sensitive to vel changes.

@philipwjones
Copy link
Contributor

You can actually get bit-for-bit too if you put parentheses around the second two (k-dependent) terms. So it looks like in the array syntax form, it's actually pulling out the k-independent first term and doing the sum of the k-dependent terms separately.

@mattdturner
Copy link
Contributor

@philipwjones The init portion being unable to use ocn_mesh makes sense, since ocn_time_integration_split_init is called before ocn_meshCreate. It seems odd, though, that using ocn_mesh in init didn't cause any problems in the standalone testing or lower resolution E3SM testing. Unless all of the failing tests had config_do_restart = .true., and all of the passing tests had config_do_restart = .false....

Should we create a PR to apply the patch to init, and add the parentheses that you mentioned for the uTemp calculation?

@philipwjones
Copy link
Contributor

@mattdturner - yeah, I couldn't figure out how the init worked either so definitely a PR for that and may as well make the paren change to ensure b4b. There's probably a way we could shift the init call in the forward driver but for know it's easier to just use the pool. At some point further down the line, need to work out the dependencies during init.

@mattdturner
Copy link
Contributor

OK. I'll go ahead and create a PR for those changes

@mattdturner
Copy link
Contributor

In my testing the PR (not yet created), it appears that the parenthees on the uTemp variable only allows some of the tests to be BFB. If I use the previous uTemp calculation (i.e., uTemp(:) = ...), then it is BFB. Do you have a preference between using the vector notation (BFB for all tests) or doing the calculation within the k loop (BFB for some tests)?

@philipwjones
Copy link
Contributor

Too bad...I guess go back to the array syntax version if it preserved bfb for all tests.

mattdturner added a commit to MPAS-Dev/MPAS-Model that referenced this issue Mar 16, 2021
This PR reverts a subset of the changes introduced in #781 that caused
non-BFB results for a few of the tests in E3SM.

See discussion E3SM-Project/E3SM#4152 for more information.

Changes include:

- No longer use ocn_mesh in ocn_time_integration_split_init. This
routine is run before ocn_meshCreate is run, so the variables in
ocn_mesh would be undefined. I'm not sure why previous testing didn't
catch this.
- Revert one of the uTemp calculations to the old array syntax
notation
jonbob added a commit that referenced this issue Mar 16, 2021
Update mpas-source: split explicit bugfix

Brings in a new mpas-source submodule with changes only to the ocean
core. It reverts a subset of the changes introduced in #4146 that caused
non-BFB results for a few of the tests in E3SM.

See MPAS-Dev/MPAS-Model#826

Fixes #4152

[non-BFB] for intel tests using the oECv3 mesh
@jonbob jonbob closed this as completed in eb4e27e Mar 17, 2021
mark-petersen pushed a commit to mark-petersen/MPAS-Model that referenced this issue Apr 12, 2021
…n/develop

This PR reverts a subset of the changes introduced in MPAS-Dev#781 that caused
non-BFB results for a few of the tests in E3SM.

See discussion E3SM-Project/E3SM#4152 for more information.

Changes include:

- No longer use ocn_mesh in ocn_time_integration_split_init. This
routine is run before ocn_meshCreate is run, so the variables in
ocn_mesh would be undefined. I'm not sure why previous testing didn't
catch this.
- Revert one of the uTemp calculations to the old array syntax
notation
jonbob pushed a commit that referenced this issue Jun 22, 2021
This PR reverts a subset of the changes introduced in #781 that caused
non-BFB results for a few of the tests in E3SM.

See discussion #4152 for more information.

Changes include:

- No longer use ocn_mesh in ocn_time_integration_split_init. This
routine is run before ocn_meshCreate is run, so the variables in
ocn_mesh would be undefined. I'm not sure why previous testing didn't
catch this.
- Revert one of the uTemp calculations to the old array syntax
notation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Critical mpas-ocean non-BFB PR makes roundoff changes to answers.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants