Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump base image to one with Ubuntu version 20.04 #364

Closed
wants to merge 3 commits into from

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Apr 13, 2023

This PR bumps the base image used in docker builds from one that uses Ubuntu 18.04 to one that uses 20.04, specifically the same base image as is used in the prognostic_run image of fv3net. This introduces new compilers (GNU version 9.4.0), which changes answers, requiring an update to the regression test checksums.

Upgrading the compilers also surfaced a new error in debug mode within the convection scheme when running with the TKE-EDMF turbulence parameterization active. This is currently relevant for the coarse-graining regression tests:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x14af0e9c0d21 in ???
#1 0x14af0e9bfef5 in ???
#2 0x14af0ab2408f in ???
#3 0x56514a6a8044 in samfdeepcnv_
    at physics/samfdeepcnv.f:1554
#4 0x56514a216adc in __module_physics_driver_MOD_gfs_physics_driver
    at GFS_layer/GFS_physics_driver.F90:3693
#5 0x565149ed78ec in __ipd_driver_MOD_ipd_step
    at /FV3/ipd/IPD_driver.F90:107
#6 0x565149211742 in __atmos_model_mod_MOD_update_atmos_physics._omp_fn.0
    at /FV3/atmos_model.F90:482
#7 0x14af0aa328e5 in ???
#8 0x5651491d8943 in __atmos_model_mod_MOD_update_atmos_physics
    at /FV3/atmos_model.F90:482
#9 0x5651491df8bb in __atmos_model_mod_MOD_update_atmos_radiation_physics
    at /FV3/atmos_model.F90:280
#10 0x565149212520 in coupler_main
    at /FV3/coupler_main.F90:192
#11 0x5651492155ec in main
    at /FV3/coupler_main.F90:35

This is a known error with a known fix (NOAA-EMC/fv3atm#33), and this fix does not change answers (ufs-community/ufs-weather-model#25). To demonstrate this, I perform this upgrade in three steps:

  1. In 96095d3 I have added non-debug-mode regression tests for coarse-graining to establish new checksums, and marked the debug-mode coarse-graining tests as expected failures.
  2. In 70ec1fd I make the fix to the convection scheme, and remove the expected failure marks from the debug mode versions. This second part does not require updating the non-debug-mode checksums (this is verified in CI).
  3. In 60844bf I remove the non-debug-mode coarse-graining regression tests.

Something else that changes when upgrading the GNU version 9 compilers is that the model compiled in REPRO mode (the default) produces different results than the model compiled in DEBUG mode. We previously relied on the assumption that REPRO mode and DEBUG mode tests produced identical results in the regression tests. Some changes to the regression tests were required to decouple from this assumption. These changes were made in 96095d3.

Finally, we also encounter a compile error when building the serialize image related to this line:

#40 110.3 driver/fvGFS/atmosphere.F90:2485:68:
#40 110.3 
#40 110.3  2469 |    do nb = 1,Atm_block%nblks
#40 110.3       |                            2                                        
#40 110.3 ......
#40 110.3  2485 |     call fs_read_field(ppser_serializer_ref, ppser_savepoint, 'nb', nb)
#40 110.3       |                                                                    1
#40 110.3 Error: Variable 'nb' at (1) set to undefined value inside loop  beginning at (2) as INTENT(OUT) argument to subroutine 'fs_read_field'
#40 110.3 driver/fvGFS/atmosphere.F90:2489:68:
#40 110.3 
#40 110.3  2469 |    do nb = 1,Atm_block%nblks
#40 110.3       |                            2                                        
#40 110.3 ......
#40 110.3  2489 |     call fs_read_field(ppser_serializer_ref, ppser_savepoint, 'nb', nb, ppser_zrperturb)
#40 110.3       |                                                                    1
#40 110.3 Error: Variable 'nb' at (1) set to undefined value inside loop  beginning at (2) as INTENT(OUT) argument to subroutine 'fs_read_field'

It is not immediately clear how to fix this, so we elect to remove the steps needed to build the serialize image from the docker file and any tests or makefile rules related to it.

@@ -235,9 +229,7 @@ FROM fv3gfs-build AS fv3gfs-compiled

RUN apt-get update && apt-get install -y \
python3 \
python3-pip && \
ln -s /bin/python3 /bin/python && \
ln -s /bin/pip3 /bin/pip
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without removing this line creating the symlink for pip3 fails since pip already exists in /bin:

ln: failed to create symbolic link '/bin/pip': File exists

I don't believe these symlinks are needed in general, so I removed the one above as well.

docker/Dockerfile Show resolved Hide resolved
docker/Dockerfile Show resolved Hide resolved
This bumps the base Ubuntu image used to one with a version of 20.04. The
pinned SHA makes things consistent with the base image we use when building the
prognostic run in fv3net.

Note that due to NOAA-EMC/fv3atm#33, the tests with the TKE-EDMF scheme active
fail in debug mode.  We currently mark them as expected failures.  To
demonstrate that answers do not change when we fix this bug, we run the
coarse-graining tests in repro mode here and checkpoint the checksums.  In a
followup commit, we will fix the bug, unmark the relevant debug mode tests, but
retain the repro mode tests to demonstrate that the tests still all pass
without modifying those checksums.  Finally in one last commit we will remove
those temporary repro mode tests as well as their checksums, as they will no
longer be needed.

Because the serialize_image fails to build with the new compilers, we also
elect to remove the infrastructure and tests associated with it.
It does allow us to add checksums for previously xfailed debug mode tests,
however.  See NOAA-EMC/fv3atm#33 for more details regarding the convection
scheme bug.
spencerkclark added a commit that referenced this pull request Sep 7, 2023
This PR refactors the build infrastructure in this repo to eliminate the need for the Docker component.  All development and testing is now done in the `nix` shell.  This should be a quality of life improvement for anyone developing the fortran model, as it no longer requires maintaining checksums in two separate build environments.

In so doing it introduces the following changes:
- New `make` rules are provided for compiling the model in different modes:
  - `build` -- build executables in `repro` (our production mode) and `debug` mode.
  - `build_repro` -- build only the `repro` mode executable.
  - `build_debug` -- build only the `debug` mode executable.
- Tests are run with each of the executables available in the local `bin` directory, and are tagged with the associated compile mode.  
- An option, `check_layout_invariance`, is provided to trigger regression tests be run with a 1x2 domain decomposition instead of a 1x1 domain decomposition to check invariance to the domain decomposition layout; this is used for the all the coarse-graining regression tests and replaces the previous `test_run_reproduces_across_layouts` test that would run in the docker image.
- `debug`-mode and `repro`-mode simulations produce different answers, which is something we noticed in #364 when upgrading compiler versions as well, and so require different reference checksums.

In working on this PR, we ran the fortran model in `debug` mode in more contexts than we had previously, some of which turned up errors, which we currently work around by using `pytest.skip` (something we had implicitly already been doing before):
- #365
- #381 

Working on this PR also brought my attention to the fact that `pytest`'s `tmpdir` fixture does not automatically get cleaned up after each test; `pytest` versions older than 7.3.0 keep around directories from the last three runs of `pytest`, which fill up disk space quickly since running these tests requires creating 10's of run directories, each with their own initial conditions and input files (#380).  For the time being I manually clean up these run directories after successful tests.

Resolves #340.
@spencerkclark
Copy link
Member Author

This PR is no longer relevant following the deprecation of the docker image in #379.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant