Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C96C48_ufs_hybatmDA breaks with 3b208124 #2524

Closed
RussTreadon-NOAA opened this issue Apr 23, 2024 · 12 comments
Closed

C96C48_ufs_hybatmDA breaks with 3b208124 #2524

RussTreadon-NOAA opened this issue Apr 23, 2024 · 12 comments
Labels
bug Something isn't working triage Issues that are triage

Comments

@RussTreadon-NOAA
Copy link
Contributor

What is wrong?

gdasfcst and enkfgdasfcst_mem* fail in the half cycle which starts C96C48_ufs_hybatmDA CI. The 2024022318 gdasfcst fails with

 0: INPUT/coupler.res: date_init=2024   2  23  12   0   0
 0: INPUT/coupler.res: date     =2024   2  23  18   0   0
 0: fcst_initialize ERROR: date_init /= date_init_res
 0:                        date_init     = 2024   2  23  18   0   0
 0:                        date_init_res = 2024   2  23  12   0   0
 0: Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 1) - process 0
srun: error: h11c29: tasks 0-39: Exited with exit code 1

What should have happened?

The half cycle gdas and enkfgdas forecasts should run to completion.

What machines are impacted?

All or N/A

Steps to reproduce

  1. clone and install g-w develop at 3b20812.
  2. enable C96C48_ufs_hybatmDA
  3. set up C96C48_ufs_hybatmDA
  4. run C96C48_ufs_hybatmDA

Additional information

This failure was discovered while attempting to run C96C48_ufs_hybatmDA using DavidNew-NOAA:feature/jediinc2fv3 at 015aec2 on Hera. Earlier in the morning DavidNew-NOAA:feature/jediinc2fv3 at 94d28d5 was successfully run on Orion.

The difference between these two snapshots of feature/jediinc2fv3 is that the Orion snapshot is before inclusion of 3b20812. The Hera snapshot is after 3b20812 was merged into feature/jediinc2fv3.

The C96C48_ufs_hybatmDA CI test has EXP_WARM_START=".true.". I see that C96C48_hybatmDA runs GSI based DA from 2021122018 with EXP_WARM_START=".false. Is this the problem? Do we need to update the C96C48_hybatmDA case to cold start initial conditions?

Do you have a proposed solution?

No response

@RussTreadon-NOAA RussTreadon-NOAA added bug Something isn't working triage Issues that are triage labels Apr 23, 2024
@aerorahul
Copy link
Contributor

@RussTreadon-NOAA
I don't think we should need to update C96C48_hybatmDA to use cold start initial conditions.
We should be able to start an experiment with restarts.
I also don't think we need to have a variable EXP_WARM_START, and rather infer the starting conditions based on the availability of the type of data.
I can prioritize this issue since 3b20812 effectively broke this capability.

@aerorahul
Copy link
Contributor

@RussTreadon-NOAA
If you could try this patch, I think you will be able to run with warm start again.

❯❯❯ git diff
diff --git i/parm/config/gfs/config.base w/parm/config/gfs/config.base
index 5a58c752..24fae5d0 100644
--- i/parm/config/gfs/config.base
+++ w/parm/config/gfs/config.base
@@ -302,6 +302,7 @@ export WRITE_NSFLIP=".true."

 # IAU related parameters
 export DOIAU="@DOIAU@"        # Enable 4DIAU for control with 3 increments
+if [[ "${MODE}" == "cycled" && "${SDATE}" == "${PDY}${cyc}" && ${EXP_WARM_START} == ".true." ]]; then export DOIAU="NO"; fi
 export IAUFHRS="3,6,9"
 export IAU_FHROT=${IAUFHRS%%,*}
 export IAU_DELTHRS=6

Please let me know if this works, and we can at least commit this while we work on a robust solution.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @aerorahul for your suggestion. I added the patch line to config.base in /scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prtest. The 2024022318 gdasfcst was rewound and booted. The forecast job failed in the same manner as before.

This makes sense. ci/cases/yamls/ufs_hybatmDA_defaults.ci.yaml includes

base:
  DOIAU: "NO"

so that the config.base in prtest has DOIAU=NO.

It's my understanding that at present we do not have an IAU functionality in JEDI ATM DA. Might there be something in 3b20812 that assumes DOIAU=YES? Alternatively, there's something wrong with the ICS I created for the C96C48_ufs_hybatmDA case.

When things break, I usually start with me as the source of the problem.

@RussTreadon-NOAA
Copy link
Contributor Author

For what it's worth, I reproduced the gdasfcst failure in C96C48_hybatmDA (GSI-based DA) by making the following changes in ci/cases/yamls/gfs_defaults_ci.yaml

@@ -2,3 +2,11 @@ defaults:
   !INC {{ HOMEgfs }}/parm/config/gfs/yaml/defaults.yaml
 base:
   ACCOUNT: {{ 'SLURM_ACCOUNT' | getenv }}
+  DOIAU: "NO"
+esfc:
+  DONST: "NO"
+nsst:
+  NST_MODEL: "1"
+sfcanl:
+  DONST: "NO"
+

The half cycle 2021122018 cold start gdasfcst did not fail. The warm start 2021122100 gdasfcst failed with

 0: CurrTime = 2021   12   21    0    0    0
 0: StopTime = 2021   12   21    9    0    0
 0: INPUT/coupler.res: date_init=2021  12  20  18   0   0
 0: INPUT/coupler.res: date     =2021  12  21   0   0   0
 0: fcst_initialize ERROR: date_init /= date_init_res
 0:                        date_init     = 2021  12  21   0   0   0
 0:                        date_init_res = 2021  12  20  18   0   0
 0: Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 1) - process 0
srun: error: h32m38: tasks 0-39: Exited with exit code 1

See /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prtest_gsi/logs/2021122100/gdasfcst.log for details.

@aerorahul
Copy link
Contributor

@RussTreadon-NOAA
I suspect the issue is in the mismatch between the dates in INPUT/coupler.res and model_configure
I am trying to replicate the ufsda test on Hera. Can you point me to your setup_ci.sh script on Hera with the initial conditions for the ufsda test?

@RussTreadon-NOAA
Copy link
Contributor Author

I use /scratch1/NCEPDEV/stmp2/Russ.Treadon/setup_ci.sh. It's currently configured to set up C96C48_hybatmDA. You'll see commented out lines in the script to set up C96C48_ufs_hybatmDA.

Don't work into the early hours of Wednesday on this!

@aerorahul
Copy link
Contributor

aerorahul commented Apr 24, 2024

As I suspected, the issue is in the date in coupler.res.
The experiment is setup to do a warm start from 2021 12 20 18 i.e. the restarts are valid at that date.
However, the restarts (and more importantly the coupler.res file) have been obtained from some other previous run. It may not have had the same cadence of cycling (6hourly), or it may have been the result of a long integration (say for spin-up).
coupler.res has 2 lines; model start time and model current time. The model current time is the one above (2021 12 20 18).

I did 2 experiments:

  1. I made the model start time == model current time in the coupler.res file. This time is consistent with the start time in model_configure. The model checks between these 3 times together with FHROT for IAU initialization. If things check out, great. If not, its the error you encountered in the bug report.
  2. I removed the coupler.res file. The model in this case solely relies on start time from model configure and the model ran without errors.

So, I think we need a consistent coupler.res file to start off the model when starting an experiment with a warm start; regardless of IAU=ON|OFF. We can do this in the workflow by adding a block on if (EXP_WARM_START == true) as a safeguard, but I would prefer the provider for initial conditions gave a properly created coupler.res file.

In short, I am suggesting we update the gdas.20240223/12/model_data/atmos/restart/20240223.180000.coupler.res file from:

     3        (Calendar: no_calendar=0, thirty_day_months=1, julian=2, gregorian=3, noleap=4)
  2024     2    23    12     0     0        Model start time:   year, month, day, hour, minute, second
  2024     2    23    18     0     0        Current model time: year, month, day, hour, minute, second

to

     3        (Calendar: no_calendar=0, thirty_day_months=1, julian=2, gregorian=3, noleap=4)
  2024     2    23    18     0     0        Model start time:   year, month, day, hour, minute, second
  2024     2    23    18     0     0        Current model time: year, month, day, hour, minute, second

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @aerorahul for identifying the problem. Let me modify coupler.res as you indicate and rerun C96C48_ufs_hybatmDA

@RussTreadon-NOAA
Copy link
Contributor Author

@aerorahul, I changed 20240223.180000.coupler.res as you suggested

era(hfe05):/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prtest/gdas.20240223/12/model_data/atmos/restart$ cat 20240223.180000.coupler.res
     3        (Calendar: no_calendar=0, thirty_day_months=1, julian=2, gregorian=3, noleap=4)
  2024     2    23    18     0     0        Model start time:   year, month, day, hour, minute, second
  2024     2    23    18     0     0        Current model time: year, month, day, hour, minute, second

After this the prtest 2024022318 gdasfcst was rewound and rebooted. ufs_model.x ran to completion.

Unfortunately, forecast_postdet.sh failed when attempting to copy 20240224.030000.coupler.res. This file does not exist.

Log file /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prtest/logs/2024022318/gdasfcst.log contains the following

+ forecast_postdet.sh[231]: ((  restart_date < forecast_end_cycle  ))
+ forecast_postdet.sh[245]: echo 'Copying FV3 restarts for '\''RUN=gdas'\'' at the end of the forecast segment: 2024022403'
Copying FV3 restarts for 'RUN=gdas' at the end of the forecast segment: 2024022403
+ forecast_postdet.sh[246]: for fv3_restart_file in "${fv3_restart_files[@]}"
+ forecast_postdet.sh[247]: restart_file=20240224.030000.coupler.res
+ forecast_postdet.sh[248]: /bin/cp -p /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gdasfcst.2024022318/restart/FV3_RESTART/20240224.030000.coupler.res /scratc\
h1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prtest/gdas.20240223/18//model_data/atmos/restart/20240224.030000.coupler.res
/bin/cp: cannot stat '/scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gdasfcst.2024022318/restart/FV3_RESTART/20240224.030000.coupler.res': No such file or direct\
ory
+ forecast_postdet.sh[1]: postamble exglobal_forecast.sh 1713952059 1
+ preamble.sh[70]: set +x
End exglobal_forecast.sh at 09:55:40 with error code 1 (time elapsed: 00:08:01)

Directory /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gdasfcst.2024022318/restart/FV3_RESTART only contains 20240224.000000.coupler.res.

Do we encounter this error due to the fact that the prtest runs with DOIAU=NO?

@RussTreadon-NOAA
Copy link
Contributor Author

As a test make the following change to a working copy of ush/forecast_postdet.sh.

@@ -242,12 +242,14 @@ FV3_out() {
   # Copy the final restart files at the end of the forecast segment
   # The final restart written at the end of the forecast does not include the valid date
   # TODO: verify the above statement since RM found that it did!
-  echo "Copying FV3 restarts for 'RUN=${RUN}' at the end of the forecast segment: ${forecast_end_cycle}"
-  for fv3_restart_file in "${fv3_restart_files[@]}"; do
-    restart_file="${forecast_end_cycle:0:8}.${forecast_end_cycle:8:2}0000.${fv3_restart_file}"
-    ${NCP} "${DATArestart}/FV3_RESTART/${restart_file}" \
-           "${COM_ATMOS_RESTART}/${restart_file}"
-  done
+  if [[ "${DOIAU:-}" == "YES" ]]; then
+      echo "Copying FV3 restarts for 'RUN=${RUN}' at the end of the forecast segment: ${forecast_end_cycle}"
+      for fv3_restart_file in "${fv3_restart_files[@]}"; do
+         restart_file="${forecast_end_cycle:0:8}.${forecast_end_cycle:8:2}0000.${fv3_restart_file}"
+         ${NCP} "${DATArestart}/FV3_RESTART/${restart_file}" \
+                "${COM_ATMOS_RESTART}/${restart_file}"
+      done
+  fi

The change places the copy fv3 restarts at the end of the forecast segment scripting inside a DOIAU=YES block.

The prtest runs with DOIAU=NO. With the above change to forecast_postdet.sh, the prtest 2024022318 gdasfcst and enkfgdasfcst ran to completion.

@aerorahul
Copy link
Contributor

In speaking offline with model developers, the restart_interval should be an explicit list. Using -1 in the past used to create the last restart, and it has been removed by user requests.
I am working on a bugfix that will create the explicit list for restart_interval in the model_configure file. That will eliminate this issue. Please stay tuned!

@RussTreadon-NOAA
Copy link
Contributor Author

C96C48_ufs_hybatmDA successfully run to completion on Hera and Orion using 2ecf4f8

Close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issues that are triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants