Refactor bit-for-bit diagnostics #300

apcraig · 2019-03-18T02:36:02Z

Refactor bit-for-bit diagnostics, #204

Developer(s): tcraig
Are the code changes bit for bit, different at roundoff level, or more substantial? bit-for-bit, #77addedd, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks.
All results are bit-for-bit but due to some new additions and some refactoring of the tests, some tests show missing regression results.
Does this PR create or have dependencies on Icepack or any other models? Yes, requires latest version of Icepack included in this PR.
Is the documentation being updated with this PR? (Y/N) Y
Other Relevant Details:

Removed all references to cpps DITTO, REPRODUCIBLE, and the old bfbflag code. Added new algorithms to carry out the global sum diagnostics leveraging various implementations. There is NO global gather to generate a bit-for-bit sum. bfbflag is now a character string and the supported options are off, lsum4, lsum8, lsum16, reprosum, and ddpdd. The algorithms are described briefly in the documentation. off==lsum8 and this option does the global sum via a local reduction then a global scalar MPI allreduce. reprosum is the recommended setting for bit-for-bit diagnostics.

This has been implemented primarily in the ice_global_reductions.F90 code. The same algorithms are implemented in the mpi and serial versions and the mpi and serial global reductions are bit-for-bit with each other with bfbflag=reprosum. The implementations work for different pe counts, different blocks sizes, mpi, threading on and off, and serial. Prior implementations only worked on a subset of these conditions.

Some new tests and scripts were added to test the log files. The logbfb test is identical to the smoke test except for comparisons to other cases, the log file is compared instead of the restart. reprosum_suite is a suite of tests that test the log file diagnostics with bfbflag='reprosum'.

The comparebfb.csh script was refactored to simplify the implementation and the log comparison was separated to a different script. In addition, there were some errors introduced in October 2018 that broke some of the comparebfb testing and this has been fixed.

A new feature to allow sleeps between test case build/submission was added to try to manage dependencies between cases better. This is not a perfect solution. The problem is that when one test is comparing it's results to another, the second has to complete after the first is done, but this does not always happen depending on how jobs are flowing thru the queue.

Obsolete cpps were removed and a bug that broke the cpps implementation in the build was fixed.

The decomp test was refactored to be more compatible with the updated comparebfb.csh script. It also now should work with multiple binary files. As a result, the directory structure of the results is slightly different and regression testing with prior versions results in missing data. This will self correct as we test in the future.

Fixed a bug in a few of the diagnostic output fields. This bug created non-reproducible results when running threaded tests. There was an error in zeroing out of the al*_init fields in threaded regions in ice_step_mod.F90.

Added 8 byte integer kinds, needed for the reprosum and ddpdd algorithms. Also use real*16 sums and reduction in the "lsum16" option. PGI does not support this type, so had to add a NO_R16 cpp for pgi builds.

Update travis to support build for c code.

Refactor Macros file to better support serial and parallel builds as well as building c code. Also needed a small update to the Makefile which had "cc" hardwired as the c compiler command.

Refactor poll_queue script and fix --bgen default bug with single tests, #299, #301.

Update job launching to avoid mpi launches with serial code. Also, work towards using serial (vs mpi) builds when possible.

…remove unused cpps

…ame way as mpi mode

duvivier · 2019-03-18T15:29:41Z

Doc looks okay, not sure why Travis CI failed though.

apcraig · 2019-03-18T15:36:21Z

I just relaunched the travis test, looks like it was failing on the input data download. Will monitor it.

duvivier · 2019-03-18T15:45:54Z

Ok. Let me know if there's something wrong with the FTP again. I haven't touched it lately and I don't think Dave has, so hopefully there's not a system issue.

apcraig · 2019-03-18T15:53:35Z

ftp worked this time, but I'm running into another problem that I have to fix. I added a c file and travis doesn't yet know how to compile c. I need to update the travis env. Hope to fix this today.

dabail10

I have not changed anything with the input files. Everything is still there. The code looks fine as far as I can tell.

apcraig · 2019-03-18T17:12:58Z

I am going to refactor the Macros files a bit to better support both serial and parallel compiles as well as compiling c code. More changes coming. Will try to test on most platforms as I go.

eclare108213 · 2019-03-18T17:17:30Z

Thanks @apcraig.
When you do that, could you provide the option for serial runs to be launched without using mpirun, please? thx

apcraig · 2019-03-18T17:28:07Z

@eclare108213, good idea. We have that for a few machines already, will try to extend that. I may need you to do some testing on your local machines. I'll email you when I need some help with that.

…ile and serial/parallel compiles better and cleaner

…irst_suite.ts

…ICE-Consortium#301, single test bgen default use

phil-blain

I'm reviewing for the issues mentioned in #301, #299 and at the end of #281. Everything seems to be resolved on my machine (cesium).
The only thing is I had to add ICE_MACHINE_QSTAT to my env file, it was missing for some reason. I've already added it to my sandbox but haven't issued a PR for that yet ; if you want to add it to this PR, go ahead:
setenv ICE_MACHINE_QSTAT "qstat "
(not sure if the space is needed, some env files have it and some don't...)
It would need to be added also to JF's machine (fram). For now we are just running on our local workstations, not on our clusters. I plan to add machines files for our clusters in the near future.

apcraig · 2019-03-21T15:32:32Z

@phil-blain, great to hear things seem to be working for you. I can add missing ICE_MACHINE_QSTAT settings in the various machine files. If we need to update them later, we can.

apcraig · 2019-03-21T17:57:07Z

@eclare108213 , are you OK with merging this PR?

apcraig · 2019-03-26T17:47:06Z

@eclare108213 , just another ping. I think this PR is ready, but it's big enough that I prefer to have your OK before merging. I will start a round of testing once this is merged, before the automated weekend testing.

eclare108213 · 2019-03-26T18:18:44Z

I have started another set of tests on badger. In the last set, 16 of them failed, but they all seem to be due to machine flakiness, e.g.

badger_intel_decomp_gx3_4x2x25x29x5.bfb/logs/cice.runlog.190320-155405:ORTE has lost communication with a remote daemon.

I suspect this is not a code-generated problem. Will update when test results come in.

eclare108213

Does the cpp SERIAL_REMOVE_MPI mean that we can use a single-source ice_global_reductions.F90 instead of having copies in comm/mpi and comm/serial? I don't understand why mpi stuff is added to comm/serial/ice_global_reductions.F90 -- the original point of having an mpi and a serial version was so that the serial version would be completely mpi-free. Nowadays that's probably not necessary, but I did find it handy on the loft laptop to be able to run quick, serial tests.

Are we allowed to redistribute cicecore/cicedynB/infrastructure/ice_shr_reprosum86.c ?
I don't understand what that code does, or when/how it interacts with CICE. Is that documented somewhere?

apcraig · 2019-03-27T02:26:42Z

@eclare108213 , these are two good questions.

Regarding the SERIAL_REMOVE_MPI, one of the problems with having separate serial and mpi directories is that things get out of sync. Stuff gets added to one, but not the other, and so forth. In addition, for these global sums, it's important (if we want global sums to be reproducible in serial and mpi mode, which they are not with bfbflag='reprosum') that the implementations in serial and mpi mirror each other. So I decided to basically take the mpi version that I implemented first and add the ability to turn off the MPI parts and then drop it into the serial directory with a directive turned on that does just that. The two files are basically identical now which makes it easy to maintain moving forward. It would also be possible to merge the serial and mpi directories if we wanted to and to use the file we have now for mpi and serial mode. So anyway, there were a few reasons why it is not bad that the mpi and serial versions of ice_global_reductions.F90 are similar. A bigger question is whether we want to continue to maintain the serial and mpi directories or whether it's better to have one with MPI cpp'ed out. I noticed a lot of differences in the global_reductions code in the two directories before I refactored them. I wonder if there are other differences (like in handling the tripole halo, etc) that might actually be missing or wrong in mpi or (more likely) serial. Generally, people aren't good at maintaining two versions of things, especially when they are only using one. I think this is an important question.

On the question about ice_shr_reprosum86.c. Yes, this can be distributed. It doesn't do anything unless the code is triggered by cpps set by the machine (or possibly the user). I inquired about removing it, but decided we should keep it. What is does is deal with integer size mismatches between fortran and c that exist in some older x86 hardware implementations and this is important to deal with if you are going to use the reprosum algorithms to get bit-for-bit correct results. A side benefit is that including a piece of c code highlighted some problems in our build that got fixed. I can remove it and then some bfbflag options might not work correctly on some architectures. I believe the community is unlikely to encounter those architectures. Having said that, if we don't like carry around the small c code, then that might be a reason to get rid of it. Otherwise, there is some benefit (both to the computation and general build capabilities) to keep it in the repo. But I'm open on this.

eclare108213 · 2019-03-27T03:27:19Z

I gravitate toward "simpler" when possible, and so my choices for both of these questions would be to simplify and document. The mpi and serial directories were hand-me-downs from POP, and I'm happy to get rid of them if the codes in them are sufficiently similar. I think a large reason for doing it that way in the first place was to reduce cpp usage, so if you have to add a cpp anyhow for reproducibility, and it can be used to merge the mpi and serial codes, then it makes sense to merge them. That sounds like a big project though, suitable for its own PR. In the meantime, does it make sense to just have one copy of ice_global_reductions.F90? It would need to be moved up in the directory structure.

Re the c code, I'd prefer not to carry it around, especially if our users are unlikely to be running on the particular architectures that wouldn't be BFB. I'd rather put a note in the documentation (there's a 'known issues' section somewhere), and perhaps include a pointer to where a user could get this code if he or she desperately needed it. I did not happen to see where it interfaced into CICE, but perhaps that part could be commented out with instructions for what to do if it were needed. I'm also open to keeping it, if there are genuine, ongoing benefits to having it, or if my suggestion to document and/or leave comments in the code doesn't make sense.

apcraig · 2019-03-27T03:42:23Z

Good comments.

I don't think we want to move to a single global_reductions file until we've thought that thru a bit. In particular, we'd have to add cpps to the build and we'd want to come up with a clean set of cpps to do that. We wouldn't want to keep the cpp name that is there. That name is particularly ugly for good reasons. I would prefer to keep the current files separate (even if they are nearly the same) until we decide to merge mpi and serial, at which point it should be a separate PR. I think it would be a mistake to have a mixed strategy of some cpps and some split directories. Right now we have split directories, the cpps that is in global_reductions is local, so it doesn't act like a model cpp. It is true that the main tradeoff for merging serial and mpi is adding cpps. But maybe it's just one, like NO_MPI that is triggered by the serial setting. That setting would trigger a cpp and not a separate file path as it doesn't now. I'm not arguing for merging or not merging right now, I think we need to think about it and get more input. In fact, one of the problems with serial and mpi is they are NOT the same and that worries me. Merging them would be some work with pros and cons. But I'm thinking there are more pros than cons at this point.

I'm OK pulling out the c code. What I would propose is that we create a new issue for it, and I do a bit more research on this implementation. This is basically Pat Worley's code and I had some emails with him a couple weeks ago about just this issue and we went back and forth about whether it should stay. I think he felt it didn't hurt anything and it was hard to argue against that. Let me create an issue at high priority to sort this out and then we can remove it if we decide that's OK.

How does that sound?

eclare108213

Sounds good.

apcraig · 2019-03-27T22:11:40Z

See issues #288 and #302.

* This is similar to the machine files cleanup in CICE-Consortium/CICE#300 * Remove MPICC, MPIFC since there is no MPI in Icepack

* Move makdep compilation to Makefile * makdep compiled from a makdep rule in the Makefile * Add CFLAG_HOST macro * makdep is only compiled if needed * This addresses CICE-Consortium/CICE#306 * Fix Travis-CI build * Update machine files to define SCC,SFC,CC,FC * This is similar to the machine files cleanup in CICE-Consortium/CICE#300 * Remove MPICC, MPIFC since there is no MPI in Icepack * Use $(DEPGEN) instead of makdep as target (Makefile) * Update documentation

apcraig added 5 commits March 15, 2019 00:46

implement bfb for mpi comm global sums, fix compile issue with cpps, …

2be61b1

…remove unused cpps

add set_nml.reprosum

b273f5c

update global reductions for serial mode to leverage bfbflag in the s…

d81f137

…ame way as mpi mode

update and fix bfb compare feature

77added

add comparelog.csh script and logbfb test

add34dc

apcraig self-assigned this Mar 18, 2019

apcraig requested review from eclare108213, mattdturner, dabail10 and duvivier March 18, 2019 02:36

duvivier approved these changes Mar 18, 2019

View reviewed changes

update travis gcc, fix tdir feature with tests

23dcb41

dabail10 approved these changes Mar 18, 2019

View reviewed changes

apcraig added 5 commits March 18, 2019 18:30

update Makefile for c compiles, update Macros files to support c comp…

b195dd9

…ile and serial/parallel compiles better and cleaner

modify report results script to address grep for bfbcomp cases, add f…

d66766d

…irst_suite.ts

fix Macros.onyx_intel, accidently removed debug flags in earlier commit

956fc49

Merge branch 'bfb' of https://github.com/apcraig/cice into bfb

236a3f1

update travis c build

a7f5013

apcraig mentioned this pull request Mar 19, 2019

Add ability for user to set test directory #281

Merged

fix script logic for more threads than tasks per node

67ed032

This was referenced Mar 20, 2019

cice.setup for tests : --bgen default does not work #301

Closed

poll_queue script / general csh usage #299

Closed

update scripts to address CICE-Consortium#299, poll_queue script and C…

feead25

…ICE-Consortium#301, single test bgen default use

apcraig requested a review from phil-blain March 20, 2019 18:24

phil-blain approved these changes Mar 21, 2019

View reviewed changes

add QSTAT variables to fram and cesium

b40f31f

eclare108213 reviewed Mar 27, 2019

View reviewed changes

eclare108213 approved these changes Mar 27, 2019

View reviewed changes

apcraig merged commit d417b97 into CICE-Consortium:master Mar 27, 2019

This was referenced Mar 27, 2019

bit-for-bit diagnostics #204

Closed

Fix hobart and cheyenne compile issues #303

Merged

phil-blain added a commit to phil-blain/Icepack that referenced this pull request Apr 24, 2019

Update machine files to define SCC,SFC,CC,FC

e6269c8

* This is similar to the machine files cleanup in CICE-Consortium/CICE#300 * Remove MPICC, MPIFC since there is no MPI in Icepack

apcraig deleted the bfb branch August 17, 2022 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor bit-for-bit diagnostics #300

Refactor bit-for-bit diagnostics #300

apcraig commented Mar 18, 2019 •

edited

Loading

duvivier commented Mar 18, 2019

apcraig commented Mar 18, 2019

duvivier commented Mar 18, 2019

apcraig commented Mar 18, 2019

dabail10 left a comment

apcraig commented Mar 18, 2019

eclare108213 commented Mar 18, 2019

apcraig commented Mar 18, 2019

phil-blain left a comment

apcraig commented Mar 21, 2019

apcraig commented Mar 21, 2019

apcraig commented Mar 26, 2019

eclare108213 commented Mar 26, 2019

eclare108213 left a comment

apcraig commented Mar 27, 2019

eclare108213 commented Mar 27, 2019

apcraig commented Mar 27, 2019

eclare108213 left a comment

apcraig commented Mar 27, 2019

Refactor bit-for-bit diagnostics #300

Refactor bit-for-bit diagnostics #300

Conversation

apcraig commented Mar 18, 2019 • edited Loading

duvivier commented Mar 18, 2019

apcraig commented Mar 18, 2019

duvivier commented Mar 18, 2019

apcraig commented Mar 18, 2019

dabail10 left a comment

Choose a reason for hiding this comment

apcraig commented Mar 18, 2019

eclare108213 commented Mar 18, 2019

apcraig commented Mar 18, 2019

phil-blain left a comment

Choose a reason for hiding this comment

apcraig commented Mar 21, 2019

apcraig commented Mar 21, 2019

apcraig commented Mar 26, 2019

eclare108213 commented Mar 26, 2019

eclare108213 left a comment

Choose a reason for hiding this comment

apcraig commented Mar 27, 2019

eclare108213 commented Mar 27, 2019

apcraig commented Mar 27, 2019

eclare108213 left a comment

Choose a reason for hiding this comment

apcraig commented Mar 27, 2019

apcraig commented Mar 18, 2019 •

edited

Loading