-
Notifications
You must be signed in to change notification settings - Fork 4
Description
When working on issue #335 in CABLE, the associated benchcab simulations returned numerical precision differences in all variables for the fluxsite experiments, see here.
After investigation, it turns out this is due to loading openmpi when doing serial compilation.
Tests performed
Running benchcab with main and #335 branch returned differences in fluxsite outputs between realisations.
Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules. These outputs are identical to the outputs of main using benchcab.
It turns out if we compile CABLE, serially, using the build.bash script but loading an openmpi module (3 versions were tested), then the #335 branch gives slightly different results to the main branch. This happens even so the compilation does not use the openmpi module directly, it's probably a difference in some environment variable.
What do we want to do?
This is annoying as it may result in false negative results from benchcab.
Do we want to investigate further to identify where the difference in the environment actually is? Is that useful?
Do we want to fix that in benchcab? Would that mean only loading the necessary modules at compilation time or is there another solution?
@SeanBryan51 @bschroeter @abhaasgoyal @Whyborn mentioning you since I'd appreciate some discussion here.