Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Enable UPP 2d decomposition #917

Merged

Conversation

MichaelLueken
Copy link
Collaborator

@MichaelLueken MichaelLueken commented Sep 21, 2023

DESCRIPTION OF CHANGES:

With recent changes made to the UFS-WM (PR #1823), it is now possible to see in the log files that UPP 2d decomposition is being performed for inline post. To this end, the UFS-WM hash has been updated to 020e783 (October 27, 2023), the UPP hash was updated to fae617b (October 6, 2023), and the UFS_UTILS hash was updated to dc0e4a6 (November 6, 2023).

With the updated UPP hash, the old postxconfig-NT-fv3lam.txt file had been removed. Wen suggested using the postxconfig-NT-fv3lam_rrfs.txt file instead. Unfortunately, due to issues with the CRTM on Derecho, we are unable to move forward with this postxconfig file (this file contains assimilated radiances). The ufs-weather-model/tests/parm directory contains a copy of the old postxconfig-NT-fv3lam.txt file. To expedite moving forward with this update, I'm currently pointing to this copy of the postxconfig file. Once the issue with the CRTM on Derecho has been corrected, the SRW App repository can transition to the postxconfig-NT-fv3lam_rrfs.txt file.

Changes to enable 2d decomposition include:

  • parm/model_configure - Added itasks to the model_configure file (values greater than 1 enable 2d decomposition in inline post).
  • scripts/exregional_run_post.sh - Added numx to the end of the &NAMPGB namelist options (values of numx greater than 1 enable 2d decomposition in offline post).
  • ush/create_model_configure_file.py - Added itasks to the list of variables to be added to the model_configure file.
  • tests/WE2E/test_configs/grids_extrn_mdls_suites_community/config.grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2.yaml - Added ITASKS: 2 to enable inline post 2d decomposition.
  • tests/WE2E/test_configs/grids_extrn_mdls_suites_community/config.grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta.yaml - Added NUMX: 2 to enable offline post 2d decomposition.

The ability to run comprehensive tests has been added back into the Jenkinsfile, as well as the ability to run the automated Jenkins tests on Derecho.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • hercules.intel
  • orion.intel
  • derecho.intel
  • gaea.intel
  • gaea-c5.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)
    The 2D decomposition does not work with GNU compilers (the run_post and run_fcst tasks fail on Hera GNU with double free or corruption errors). Disabling 2D decomposition for GNU compilers will allow the updated 2D decomposition WE2E tests to run on Hera GNU.

DEPENDENCIES:

None

DOCUMENTATION:

Documentation has been updated in the WE2E test configuration files, the ConfigWorkflow.rst chapter, and config_defaults.yaml.

ISSUE:

Resolves #480

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

MichaelLueken and others added 19 commits September 14, 2023 14:30
…CRTM coefficients and add CRTM_DIR to Hera machine yaml file.
…st. Add itasks: 2 to the model_configure file for the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test for inline post 2d decomposition and add numx: 2 to the itag namelist option file for the grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta test for offline post 2d decomposition.
…fficient directory and cleaned up comment in build_derecho_intel.lua.
… up-to-date versions. Use configxpost-NT-fv3lam.txt from ufs-weather-model/tests/parm instead of configxpost-NT-fv3lam_rrfs.txt until CRTM issue is addressed on Derecho. Add Derecho label to Jenkinsfile and reactivate comprehensive testing in Jenkins.
@MichaelLueken MichaelLueken force-pushed the feature/upp_2d_decomp branch 2 times, most recently from b8a6b57 to e720541 Compare November 8, 2023 15:07
MichaelLueken and others added 8 commits November 14, 2023 09:47
…Workflow.rst with Gillian's suggestions.

Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
…Workflow.rst with Gillian's suggestions.

Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
…unity/config.grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2.yaml with Gillian's suggestions.

Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
…unity/config.grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta.yaml with Gillian's suggestions.

Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
…Workflow.rst with Gillian's suggestions.

Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
…Workflow.rst with Gillian's suggestions.

Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
…ition documentation in ush/config_defaults.yaml and added prints regarding resetting NUMX and ITASKS in tests/WE2E/run_WE2E_tests.py.
@MichaelLueken
Copy link
Collaborator Author

@gspetro-NOAA - Thank you very much for reviewing these changes! I have committed the suggestions that you have made to the documentation, replaced the lowercase d with uppercase D for ITASKS in config_defaults.yaml for consistency, and added logging prints to run_WE2E_tests.py to let users know that the values are being reset for GNU compilers. I'm hoping I'll be able to satisfactorily go through this morning's code review! Please let me know if you see any other changes that should be made or if you have any questions.

… reset in run_WE2E_tests.py and replace lowercase d with uppercase D in grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta documentation.
@gspetro-NOAA
Copy link
Collaborator

gspetro-NOAA commented Nov 16, 2023

It looks like we need to add the line:

DOMAIN_PREGEN_BASEDIR: /contrib/EPIC/UFS_SRW_data/develop/FV3LAM_pregen

to the noaacloud machine file for the fundamental tests to run on the cloud. @EdwardSnyder-NOAA said that we did this for the release in PR #937 and should do that in develop, too. Without it, experiment generation failed. After I added that line, fundamental tests generally pass.

  • On AWS, only grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta fails, but it is failing due to hitting the walltime for the post-processing tasks. It passes on GCP.
  • On GCP, only grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0 fails due to hitting the walltime for the run_fcst task. It passes on AWS.
    I can approve once the machine file change is made.

…ml to allow for workflow generation on NOAA Cloud platforms.
@MichaelLueken
Copy link
Collaborator Author

Thank you very much for running the fundamental tests on NOAA Cloud platforms, @gspetro-NOAA! I have added the DOMAIN_PREGEN_BASEDIR entry into ush/machine/noaacloud.yaml at
ec44d97.

Copy link
Collaborator

@gspetro-NOAA gspetro-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Approving...

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question before I approve

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I let this slide off my plate: I'll approve now so I don't hold up this PR any longer: my only concern was a bit of confusion on two different parameters controlling the same thing (one for inline post, one for offline post). But this is a documentation issue if it is an issue at all.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Feb 16, 2024
@MichaelLueken
Copy link
Collaborator Author

On Jet, the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 WE2E coverage test had failed. Rocotorewind/rocotoboot allowed the test to successfully pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240216161033                                           COMPLETE              20.37
custom_ESGgrid_20240216161035                                      COMPLETE              17.82
custom_ESGgrid_Great_Lakes_snow_8km_20240216161036                 COMPLETE              58.08
custom_GFDLgrid_20240216161038                                     COMPLETE               9.41
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202402  COMPLETE              10.73
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              55.55
get_from_HPSS_ics_RAP_lbcs_RAP_20240216161043                      COMPLETE              17.62
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240216161044  COMPLETE             338.41
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              45.93
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.29
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             523.51
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.59
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1116.31

This also shows that the code in this branch is ready to run on Rocky8, since the xjet partition is required to run the forecast job. One caveat is that the Jenkins pipeline for Jet will need to be set to one of the login nodes that has been transitioned to Rocky8, or else the run_fcst job will never run (you need to be logged into a Rocky8 login node in order to use the xjet partition).

Once a successful run of the nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 WE2E test completes on Derecho, this work can get merged. Currently encountering issues with the get_extrn_ics and get_extrn_lbcs tasks failing to retrieve the necessary data from AWS in 3 hours (this has only been encountered beginning after the Presidents' Day weekend). If this task continues to fail, then the memory for these tasks will need to be raised from 2G to a higher value.

@MichaelLueken
Copy link
Collaborator Author

Manually submitting the Jenkins coverage WE2E tests on Derecho successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km_20240221070627                      COMPLETE              30.98
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              49.38
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022107063  COMPLETE              54.23
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240221  COMPLETE              36.72
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              22.19
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022107064  COMPLETE              48.32
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              30.16
pregen_grid_orog_sfc_climo_20240221070649                          COMPLETE              23.06
specify_template_filenames_20240221070651                          COMPLETE              23.59
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             318.63

Moving forward with merging this work now.

@MichaelLueken MichaelLueken merged commit b3f1295 into ufs-community:develop Feb 21, 2024
3 of 5 checks passed
@MichaelLueken MichaelLueken deleted the feature/upp_2d_decomp branch February 21, 2024 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable UPP 2D decomposition in the SRW app
3 participants