Skip to content

evp kernel version 2 testing and validation #279

@apcraig

Description

@apcraig

We are going to merge PR #278, PR #252. There are several outstanding issues, basically copied from the end of #252,


Let me summarize where we are.

With evp_kernel_ver=0, results are bit-for-bit for most tests against the current master. This is running full test suites on gordon for 4 compilers. A subset of box tests are NOT bit-for-bit on 3/4 compilers. Rerunning the failed box tests with the debug flag (reduced optimization and run time checks) on both master and this PR results in bit-for-bit identical answers. It seems the changes in the answers in the box test is caused by some compiler optimization as a results of the code changes. This might be associated with the evp kernel changes (although @mhrib makes a case it shouldn't) or it might be associated with some of the code cleanup. We could look into this further or we could accept it. Personally, I am comfortable with this outcome as it stands. I believe we've shown the answers are roundoff different (see above gbox128 diff) as a result of compiler optimization and that we can make this bit-for-bit if we reduce compiler optimization. I think based on these results, we could merge this PR. evp_kernel_ver=0 will be the default setting.

Separately, there is an effort to test and validate the evp_kernel_ver=2. The same test suite on gordon was run with the new kernel on. Results can be found https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks, hash aa6de33...+evpk=2. Three to four tests fail on each compiler, and they are the same tests across the compilers. Looking at the intel results, https://github.com/CICE-Consortium/Test-Results/wiki/aa6de33f19.gordon.pgi.190128.235649, there are four failures.

  • restart gbox128 4x2. This test runs but fails to restart exactly.
  • restart gx1 40x4 droundrobin medium. This test fails with "(abort_ice) error = (horizontal_remap)ERROR: bad departure points" on the first timestep.
  • restart gx3 16x2x5x10x20 drakeX2. This test fails with "(abort_ice) error = (horizontal_remap)ERROR: bad departure points" on the first timestep.
  • restart tx1 40x4 dsectrobin medium. This test fails gracefully in the evp kernel. tx1 is not supported yet.

Again, many tests passed, but these 4 failures need to be debugged. In addition, the qc test relies on the gx1 configuration, so the qc testing comparing evp_kernel_ver=2 to 0 could not be done.

So, the outstanding tasks are

  • debug the 4 failures noted above
  • run the qc test comparing evp_kernel_ver=0 to evp_kernel_ver=2. This requires gx1 (one of the failing tests)
  • update documentation
  • change evp_kernel_ver variable to kevp_kernel
  • produce and document timing information comparing evp_kernel_ver=0 and 2.
  • add evp_kernel_ver=2 tests to the test suite
  • maybe do a little cleanup on ice_dyn_evp_1d.F90 to make the code a little more readable (breaks between subroutines and such)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions