Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect gfs job dependencies caused gfsmetpg2g1 failure #2899

Closed
RussTreadon-NOAA opened this issue Sep 9, 2024 · 1 comment · Fixed by #2907
Closed

Incorrect gfs job dependencies caused gfsmetpg2g1 failure #2899

RussTreadon-NOAA opened this issue Sep 9, 2024 · 1 comment · Fixed by #2907
Assignees
Labels
bug Something isn't working

Comments

@RussTreadon-NOAA
Copy link
Contributor

What is wrong?

Jobs gfscleanup and gfsmetpg2g1 both depend upon the completion of gfsarch. As such, it is possible for both jobs to be concurrently running. This is problematic. gfscleanup removes the directory in which the gfsmetpg2g1 job is running.

This behavior was observed on Hera but likely impacts all machines.

What should have happened?

The gfsmetp suite of jobs should run to completion before gfscleanup removes the run directory.

What machines are impacted?

All or N/A, Hera

Steps to reproduce

  1. clone g-w develop
  2. set up g-w CI for GSI or JEDI ATM based DA
  3. cycle to gfsmetp jobs

Additional information

A test of g-w CI C96C48_ufs_hybatmDA on Hera encountered the following scenario.

The 2024022400 gfsarch job completed. rocotorun submitted gfscleanup and gfsmetpg2g1. Both of these jobs have a single xml dependency. This single dependency is completion of gfsarch.

        <dependency>
                <and>
                        <taskdep task="gfsarch"/>
                </and>
        </dependency>

Jobs gfsmetgp2g1 and gfscleanup started at the same time, Mon Sep 9 17:03:20 UTC 2024 gfscleanup finished at Mon Sep 9 17:03:48 UTC 2024. One of the last actions gfscleanup does is to remove the top-level gfs run directory for the cycle

+ exglobal_cleanup.sh[118]: rm -rf /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400
+ exglobal_cleanup.sh[120]: echo 'Cleanup /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400 completed!'

Unfortunately, gfsmetpg2g1 was running in /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400/metpg2g1.2502384. Removal of /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400 deleted the gfsmetpg2g1 run directory. Job gfsmetpg2g1 aborted at Mon Sep 9 17:03:52 UTC 2024 with the error messges

OSError: [Errno 116] Stale file handle: 'python_gen_env_vars.sh'
+ exgrid2grid_step1.sh[46]: status=1
+ exgrid2grid_step1.sh[47]: [[ 1 -ne 0 ]]
+ exgrid2grid_step1.sh[47]: exit 1
+ JGFS_ATMOS_VERIFICATION[1]: postamble JGFS_ATMOS_VERIFICATION 1725901403 1

Do you have a proposed solution?

No response

@RussTreadon-NOAA RussTreadon-NOAA added bug Something isn't working triage Issues that are triage labels Sep 9, 2024
@DavidHuber-NOAA DavidHuber-NOAA self-assigned this Sep 13, 2024
@DavidHuber-NOAA DavidHuber-NOAA removed the triage Issues that are triage label Sep 13, 2024
@DavidHuber-NOAA
Copy link
Contributor

I will fix this as part of #2907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants