Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thompson subcycling for develop, add missing hera.gnu debug modulefile #632

Merged

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Jun 9, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Description

Implement a subcycling capability for Thompson MP in CCPP and exercise it in the UFS regression tests. This ufs-weather-model PR only updates the submodule pointer for fv3atm for the changes w.r.t. Thompson MP described in the associated PRs below.

Additional changes:

  • add missing modulefile ufs_hera.gnu_debug
  • bugfix in compile.sh to copy the correct modulefile when DEBUG is used

The changes to the suite FV3_GSD_noah/FV3_GFS_v16_thompspn (run Thompson MP with 4/2 subcycles) change the answer of several Thompson MP based regression tests. No new input data required.

Issue(s) addressed

Fixes #596

Testing

Preliminary regression testing on Hera with GNU and Intel against existing baseline (2021/05/26): all tests that are expected to pass do pass, and all tests that are expected to fail do fail with b4b mismatches (but they all run to completion):

rt_hera_gnu_verify_against_existing.log
rt_hera_gnu_verify_against_existing_fail_test.log

rt_hera_initel_verify_against_existing.log
rt_hera_initel_verify_against_existing_fail_test.log

Full regression tests will be run on all tier-1 platforms when it is time to commit.

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3

Dependencies

NCAR/ccpp-framework#379
NCAR/ccpp-physics#676
NOAA-EMC/fv3atm#328
#632

@climbfuji
Copy link
Collaborator Author

Machine: cheyenne
Compiler: gnu
Job: BL
Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/666296517/20210609150010/ufs-weather-model
Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_37383
Test fv3_HAFS_v0_hwrf_thompson_debug 024 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson_debug 025 failed in run_test failed
Test fv3_HAFS_v0_hwrf_thompson 012 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson 013 failed in run_test failed
Test fv3_rrfs_v1beta_debug 017 failed in run_test failed
Test regional_control_debug 015 failed in run_test failed
Test fv3_gsd_debug 018 failed in run_test failed
Test fv3_rrfs_v1alpha_debug 016 failed in run_test failed
Test control_thompson_no_aero_debug 020 failed in run_test failed
Test control_thompson 005 failed in run_test failed
Test control_thompson_debug 019 failed in run_test failed
Test control_thompson_no_aero 006 failed in run_test failed
Test fv3_gsd 009 failed in run_test failed
Test fv3_rrfs_v1alpha 010 failed in run_test failed
Test fv3_rrfs_v1beta 011 failed in run_test failed
Please make changes and add the following label back:
cheyenne-gnu-BL

@BrianCurtis-NOAA this is because you stopped them because of the GitHub token change, correct?

@BrianCurtis-NOAA
Copy link
Collaborator

I haven't stopped anything yet. and haven't switched accesstoken on Cheyenne either.

@BrianCurtis-NOAA
Copy link
Collaborator

Machine: cheyenne
Compiler: gnu
Job: BL
Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/666296517/20210609154508/ufs-weather-model
Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_69021
Test fv3_HAFS_v0_hwrf_thompson_debug 024 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson_debug 025 failed in run_test failed
Test fv3_HAFS_v0_hwrf_thompson 012 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson 013 failed in run_test failed
Test fv3_rrfs_v1beta_debug 017 failed in run_test failed
Test fv3_rrfs_v1alpha 010 failed in run_test failed
Test fv3_gsd_debug 018 failed in run_test failed
Test regional_control_debug 015 failed in run_test failed
Test fv3_rrfs_v1alpha_debug 016 failed in run_test failed
Test fv3_rrfs_v1beta 011 failed in run_test failed
Test fv3_gsd 009 failed in run_test failed
Test control_thompson_debug 019 failed in run_test failed
Test control_thompson_no_aero_debug 020 failed in run_test failed
Test control_thompson 005 failed in run_test failed
Test control_thompson_no_aero 006 failed in run_test failed
Please make changes and add the following label back:
cheyenne-gnu-BL

@BrianCurtis-NOAA
Copy link
Collaborator

BrianCurtis-NOAA commented Jun 9, 2021

Note sure how helpful this is:
I've went through 4 of those tests and they all have a lot of

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

@BrianCurtis-NOAA
Copy link
Collaborator

Machine: gaea
Compiler: intel
Job: BL
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/666296517/20210609214522/ufs-weather-model
Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_7787
Test fv3_gsd 035 failed failed
Test fv3_gsd 035 failed in run_test failed
Test fv3_rrfs_v1alpha 036 failed failed
Test fv3_rrfs_v1alpha 036 failed in run_test failed
Test fv3_hrrr 038 failed failed
Test fv3_hrrr 038 failed in run_test failed
Test fv3_rrfs_v1beta 039 failed failed
Test fv3_rrfs_v1beta 039 failed in run_test failed
Test fv3_HAFS_v0_hwrf_thompson 040 failed failed
Test fv3_HAFS_v0_hwrf_thompson 040 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson 041 failed failed
Test fv3_esg_HAFS_v0_hwrf_thompson 041 failed in run_test failed
Test regional_quilt_hafs 032 failed failed
Test regional_quilt_hafs 032 failed in run_test failed
Test regional_control 029 failed failed
Test regional_control 029 failed in run_test failed
Test regional_quilt 031 failed failed
Test regional_quilt 031 failed in run_test failed
Test regional_quilt_RRTMGP 034 failed failed
Test regional_quilt_RRTMGP 034 failed in run_test failed
Test regional_quilt_netcdf_parallel 033 failed failed
Test regional_quilt_netcdf_parallel 033 failed in run_test failed
Test fv3_rap 037 failed failed
Test fv3_rap 037 failed in run_test failed
Test control_thompson 026 failed failed
Test control_thompson 026 failed in run_test failed
Test control_thompson_no_aero 027 failed failed
Test control_thompson_no_aero 027 failed in run_test failed
Please make changes and add the following label back:
gaea-intel-BL

@MinsukJi-NOAA
Copy link
Contributor

The exact same tests failed again on Cray.

On Dell, these failed (all overlap with Cray failed jobs):
fv3_gsd, fv3_rrfs_v1alpha, fv3_rrfs_v1beta, fv3_hrrr, fv3_rap, regional_control, regional_quilt, regional_quilt_hafs, regional_quilt_netcdf_parallel, regional_quilt_RRTMGP

@BrianCurtis-NOAA
Copy link
Collaborator

Machine: orion
Compiler: intel
Job: BL
Repo location: /work/noaa/nems/emc.nemspara/autort/pr/666296517/20210609164515/ufs-weather-model
Please manually delete: /work/noaa/stmp/bcurtis/stmp/bcurtis/FV3_RT/rt_275485
Test control_thompson_no_aero 029 failed failed
Test control_thompson 028 failed failed
Test control_thompson_no_aero 029 failed in run_test failed
Test control_thompson 028 failed in run_test failed
Test fv3_gsd 037 failed failed
Test fv3_gsd 037 failed in run_test failed
Test fv3_rrfs_v1alpha 038 failed failed
Test fv3_rrfs_v1alpha 038 failed in run_test failed
Test regional_control 031 failed failed
Test regional_control 031 failed in run_test failed
Test regional_quilt 033 failed failed
Test regional_quilt 033 failed in run_test failed
Test regional_quilt_hafs 034 failed failed
Test regional_quilt_hafs 034 failed in run_test failed
Test fv3_HAFS_v0_hwrf_thompson 042 failed failed
Test fv3_HAFS_v0_hwrf_thompson 042 failed in run_test failed
Test fv3_rap 039 failed failed
Test fv3_hrrr 040 failed failed
Test fv3_rap 039 failed in run_test failed
Test fv3_hrrr 040 failed in run_test failed
Test fv3_rrfs_v1beta 041 failed failed
Test fv3_rrfs_v1beta 041 failed in run_test failed
Test regional_quilt_netcdf_parallel 035 failed failed
Test regional_quilt_netcdf_parallel 035 failed in run_test failed
Test regional_quilt_RRTMGP 036 failed failed
Test regional_quilt_RRTMGP 036 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson 043 failed failed
Test fv3_esg_HAFS_v0_hwrf_thompson 043 failed in run_test failed
Please make changes and add the following label back:
orion-intel-BL

@BrianCurtis-NOAA
Copy link
Collaborator

Machine: cheyenne
Compiler: intel
Job: BL
Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/666296517/20210609160011/ufs-weather-model
Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_25460
Test regional_quilt_hafs 032 failed in run_test failed
Test regional_quilt 031 failed in run_test failed
Test regional_control 029 failed in run_test failed
Test regional_quilt_netcdf_parallel 033 failed in run_test failed
Test regional_quilt_RRTMGP 034 failed in run_test failed
Test fv3_rrfs_v1alpha 036 failed in run_test failed
Test fv3_rap 037 failed in run_test failed
Test fv3_rrfs_v1beta 039 failed in run_test failed
Test fv3_hrrr 038 failed in run_test failed
Test fv3_gsd 035 failed in run_test failed
Test control_thompson_no_aero 027 failed in run_test failed
Test control_thompson 026 failed in run_test failed
Test fv3_HAFS_v0_hwrf_thompson 040 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson 041 failed in run_test failed
Please make changes and add the following label back:
cheyenne-intel-BL

@BrianCurtis-NOAA
Copy link
Collaborator

Machine: jet
Compiler: intel
Job: BL
Repo location: /lfs4/HFIP/h-nems/emc.nemspara/autort/pr/666296517/20210609220012/ufs-weather-model
Please manually delete: /lfs4/HFIP/h-nems/emc.nemspara/RT_RUNDIRS/emc.nemspara/FV3_RT/rt_125284
Test control_thompson 028 failed failed
Test control_thompson 028 failed in run_test failed
Test control_thompson_no_aero 029 failed failed
Test control_thompson_no_aero 029 failed in run_test failed
Test regional_quilt_netcdf_parallel 035 failed failed
Test regional_quilt_netcdf_parallel 035 failed in run_test failed
Test regional_quilt_hafs 034 failed failed
Test regional_quilt_hafs 034 failed in run_test failed
Test regional_control 031 failed failed
Test regional_control 031 failed in run_test failed
Test fv3_esg_HAFS_v0_hwrf_thompson 040 failed failed
Test fv3_esg_HAFS_v0_hwrf_thompson 040 failed in run_test failed
Test fv3_gsd 037 failed failed
Test fv3_gsd 037 failed in run_test failed
Test fv3_rrfs_v1alpha 038 failed failed
Test fv3_rrfs_v1alpha 038 failed in run_test failed
Test regional_quilt 033 failed failed
Test regional_quilt 033 failed in run_test failed
Test regional_quilt_RRTMGP 036 failed failed
Test regional_quilt_RRTMGP 036 failed in run_test failed
Test fv3_HAFS_v0_hwrf_thompson 039 failed failed
Test fv3_HAFS_v0_hwrf_thompson 039 failed in run_test failed
Please make changes and add the following label back:
jet-intel-BL

@climbfuji
Copy link
Collaborator Author

I made a stupid mistake in the last-minute code changes based on the code review. It's fixed now, regression tests have been kicked off manually on all systems. I also started the CI tests just now.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jun 10, 2021 via email

@climbfuji
Copy link
Collaborator Author

Dom, just to confirm, you are also running RT on wcoss, please let me know if you have any issue, I can help to run RT on wcoss.

Thanks a lot. Seems to be working. I created the baseline on cray, will start verification after copying it over. Dell is still busy creating baselines. The biggest bottleneck is hera. If all other platforms finish except hera, we may have to do the merge w/o waiting for hera to finish ...

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jun 10, 2021 via email

@climbfuji
Copy link
Collaborator Author

CI tests passed for commit a3910a9.

@climbfuji
Copy link
Collaborator Author

Regression tests passed on all machines. I pushed the gaea.intel log directly from the machine, all others I copied to my laptop, but didn't commit them yet. This way I can show users how to pull in the updates to simulate what they need to do when auto-rt pushes some of the log files.

Copy link
Collaborator

@junwang-noaa junwang-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the hera_gnu debug modules.
Once all the RT log files are committed, the code can be committed.

@climbfuji climbfuji added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jun 10, 2021
@junwang-noaa junwang-noaa merged commit ab3f8f8 into ufs-community:develop Jun 10, 2021
epic-cicd-jenkins pushed a commit that referenced this pull request Apr 17, 2023
…632)

* Build UPP for AQM (Online-CMAQ).
* Add four new cycles to the workflow XML file for real-time run with varying forecast length hours.
* Update the UPP input namelist and control file.

---------

Co-authored-by: chan-hoo <chan-hoo.jeon@clogin04.cactus.wcoss2.ncep.noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement subcycling capability for Thompson microphysics
5 participants