-
Notifications
You must be signed in to change notification settings - Fork 146
Description
We are going to merge PR #278, PR #252. There are several outstanding issues, basically copied from the end of #252,
Let me summarize where we are.
With evp_kernel_ver=0, results are bit-for-bit for most tests against the current master. This is running full test suites on gordon for 4 compilers. A subset of box tests are NOT bit-for-bit on 3/4 compilers. Rerunning the failed box tests with the debug flag (reduced optimization and run time checks) on both master and this PR results in bit-for-bit identical answers. It seems the changes in the answers in the box test is caused by some compiler optimization as a results of the code changes. This might be associated with the evp kernel changes (although @mhrib makes a case it shouldn't) or it might be associated with some of the code cleanup. We could look into this further or we could accept it. Personally, I am comfortable with this outcome as it stands. I believe we've shown the answers are roundoff different (see above gbox128 diff) as a result of compiler optimization and that we can make this bit-for-bit if we reduce compiler optimization. I think based on these results, we could merge this PR. evp_kernel_ver=0 will be the default setting.
Separately, there is an effort to test and validate the evp_kernel_ver=2. The same test suite on gordon was run with the new kernel on. Results can be found https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks, hash aa6de33...+evpk=2. Three to four tests fail on each compiler, and they are the same tests across the compilers. Looking at the intel results, https://github.com/CICE-Consortium/Test-Results/wiki/aa6de33f19.gordon.pgi.190128.235649, there are four failures.
- restart gbox128 4x2. This test runs but fails to restart exactly.
- restart gx1 40x4 droundrobin medium. This test fails with "(abort_ice) error = (horizontal_remap)ERROR: bad departure points" on the first timestep.
- restart gx3 16x2x5x10x20 drakeX2. This test fails with "(abort_ice) error = (horizontal_remap)ERROR: bad departure points" on the first timestep.
- restart tx1 40x4 dsectrobin medium. This test fails gracefully in the evp kernel. tx1 is not supported yet.
Again, many tests passed, but these 4 failures need to be debugged. In addition, the qc test relies on the gx1 configuration, so the qc testing comparing evp_kernel_ver=2 to 0 could not be done.
So, the outstanding tasks are
- debug the 4 failures noted above
- run the qc test comparing evp_kernel_ver=0 to evp_kernel_ver=2. This requires gx1 (one of the failing tests)
- update documentation
- change evp_kernel_ver variable to kevp_kernel
- produce and document timing information comparing evp_kernel_ver=0 and 2.
- add evp_kernel_ver=2 tests to the test suite
- maybe do a little cleanup on ice_dyn_evp_1d.F90 to make the code a little more readable (breaks between subroutines and such)