Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart reproducibility issue for global fv3 runs #272

Closed
junwang-noaa opened this issue Nov 12, 2020 · 16 comments · Fixed by #304
Closed

restart reproducibility issue for global fv3 runs #272

junwang-noaa opened this issue Nov 12, 2020 · 16 comments · Fixed by #304
Labels
bug Something isn't working

Comments

@junwang-noaa
Copy link
Collaborator

Description

It is found that the global fv3 (including the regression test cases) can not reproduce in restart cases when model restarts from fh=24hrs (e.g. control runs for 48hr, restart runs from 24hr to 48hr, results at fh=48hr from control and restart are different). However the code does reproduces when restart starts within 24 hrs and the results are compared at fh<=24. (e.g. control 24hr, restart 12hr->24hr,results at fh=24hr are identical), results do not reproduce when compared at longer forecast time than 24hr. E.g. control runs 36hr, restart runs from 12hr to 36hr, results from control and restart are identical at 24hr, but not at fh=36hr.

To Reproduce:

The issue can be reproduced on all the supported platforms including hera, orion, wcoss.

  1. Check out the code, and run fv3 control regression test for 48hr with restart interval set to 24 in model_configure
  2. Copy the run directory, remove all the output files (dynf*, phyf*, logf*). Copy the restart files at 24 hr in 1) to the input directory in 2), rename those files without the date in the file name.
  3. Change following namelist variable:
    warm_start=.true.
    nggps_ic=.false.
    external_ic=.false.
    mountain=.true.
    make_nh=.false.
    na_init=0
    if model is running with cold start having nst spin up (nstf_name (2)=1), the nstf_name(2) should be turned off for restart. (nstf_name(2)=0) e.g.: nstf_name=2,0,1,0,5
    submit the job. then compare output files with those created in 1)
@junwang-noaa junwang-noaa added the bug Something isn't working label Nov 12, 2020
@climbfuji
Copy link
Collaborator

This must be something recent! The FV3_GSD_v0 suite passes exactly this test with the code in https://github.com/NOAA-GSL/ufs-weather-model (default branch is gsd/develop), it's run every time we make a commit (see tests/rt_ccpp_gsd.conf). The last commit we merged from ufs-community / ufs-weather-model into this branch is from October 1:

commit 208f36dfa7e13be18967c60cca01a64ca02de4c7
Author: Dom Heinzeller <dom.heinzeller@icloud.com>
Date:   Thu Oct 1 06:16:09 2020 -0600

    CCPP tendencies bugfixes, global restart reproducibility, halo boundary update in dycore (#208)

You should be able to go back to this hash and get b4b reproducible results in develop. Unless it's something with the suite you are using that doesn't wreck havoc for the FV3_GSD_v0 suite.

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Nov 12, 2020 via email

@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Nov 12, 2020 via email

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Nov 12, 2020 via email

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Nov 12, 2020 via email

@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Nov 12, 2020 via email

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Nov 12, 2020 via email

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Nov 12, 2020 via email

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Nov 14, 2020 via email

@junwang-noaa
Copy link
Collaborator Author

Moorthi, Thank you! That is a good news! I don't have specific suggestions, but it looks to me we may need to restrict from using the PROD compiler option on certain files. @climbfuji, do you have any suggestion? If I remember correctly, you did some similar work when reproducing the CCPP with IPD before. 

@climbfuji
Copy link
Collaborator

Moorthi, Thank you! That is a good news! I don't have specific suggestions, but it looks to me we may need to restrict from using the PROD compiler option on certain files. @climbfuji, do you have any suggestion? If I remember correctly, you did some similar work when reproducing the CCPP with IPD before.

Thanks for all the detective work. I agree, we have to identify which file or routine is causing the difference, and then which of the three PROD optimizations (-xCORE-AVX2, -no-prec-div, -no-prec-sqrt). I would do this as follows, using the CCPP debugging routines in GFS_debug.F90.

  • put calls to GFS_diagtoscreen and GFS_interstitialtoscreen into the suite definition file, right after the interstitial_rad_reset
  • modify the two _run routines for the two schemes to only produce output for the kdt value that corresponds to the first timestep after the warmstart
  • do the full run (be sure to have --label in the srun call in the job submission script, so that the MPI rank gets prepended - this allows you to split the stdout/stderr file later by task)
  • do the coldstart/warmstart run
  • split stdout and stderr by MPI rank, then use a graphical diff tool (e.g. meld) to compare the directories and all the split files

If there are differences in the output from the two diag routines, then we need to look at the time vary group (or the dycore if GFDL-MP is used, because of the saturation adjustment). Maybe turn off the saturation adjustment for the next set of runs to see if the differences go away. If they do, it's the fv_sat_adj calls. If they don't it's in the time_vary group.

If are no differences at all in the output from the two diag routines, then it's either in the radiation or physics group or the stochastics group. In this case, add the same diagtoscreen routines immediately after the GFS_stateout_reset call in the SDF - this tells you whether it is the radiation group or not. If it is not the radiation group, add calls at the beginning of the GFS_stochastics group - tells you whether it is in the physics or the stochastic group.

Once we know the group, use an iterative approach (bisect the group / block in the group) until the scheme is identified.

Moorthi, I can do all this for you if you like. What I would need is

  • a complete, self-contained run directory with a job submission script where the only thing I have to do is link the executable and possible a modulefile
  • a complete source code directory with instructions on how you compile the code
  • on a machine that is not WCOSS, unfortunately

Hope this helps.

@junwang-noaa
Copy link
Collaborator Author

I believe we still need to test the restart reproducibility on standalone global FV3 after PR#304. I haven't seen results on restart test from 24->48h yet.

@junwang-noaa junwang-noaa reopened this Dec 2, 2020
@climbfuji
Copy link
Collaborator

@SMoorthi-emc @junwang-noaa has this issue been fixed with today's merge of #304?

@junwang-noaa
Copy link
Collaborator Author

Working on regional inline post, I haven't got time to run some tests yet.

@SMoorthi-emc
Copy link
Contributor

I don't know. In my own tests, I have to run in REPRO mode with CCPP.

@climbfuji
Copy link
Collaborator

Closed via #325.

pjpegion pushed a commit to NOAA-PSL/ufs-weather-model.p7b that referenced this issue Jul 20, 2021
* fix lam post uninitialied fields
* remove spval in openmp
* add more uninitialized post fields
* update suite_FV3_GFS_v15_thompson_mynn_lam3km.xml to use mynnsfc_wrapper instead of sfc_diff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants