Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNU build on Hera is failing #962

Closed
GeorgeGayno-NOAA opened this issue Jun 13, 2024 · 24 comments · Fixed by #965
Closed

GNU build on Hera is failing #962

GeorgeGayno-NOAA opened this issue Jun 13, 2024 · 24 comments · Fixed by #965
Assignees
Labels

Comments

@GeorgeGayno-NOAA
Copy link
Collaborator

The head of develop (2794d41) no longer compiles on Hera with Gnu. I get this error:

Lmod has detected the following error:  The following module(s) are unknown: "openmpi/4.1.5"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "openmpi/4.1.5"

Also make sure that all modulefiles written in TCL start with the string #%Module

Executing this command requires loading "openmpi/4.1.5" which failed while processing the following module(s):

    Module fullname      Module Filename
    ---------------      ---------------
    stack-openmpi/4.1.5  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/gcc/9.2.0/stack-openmpi/4.1.5.lua
    build.hera.gnu       /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS.upstream/modulefiles/build.hera.gnu.lua
While processing the following module(s):
    Module fullname      Module Filename
    ---------------      ---------------
    stack-openmpi/4.1.5  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/gcc/9.2.0/stack-openmpi/4.1.5.lua
    build.hera.gnu       /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS.upstream/modulefiles/build.hera.gnu.lua
@GeorgeGayno-NOAA
Copy link
Collaborator Author

@AlexanderRichert-NOAA - FYI

@AlexanderRichert-NOAA
Copy link
Collaborator

I'll look into this, but tagging @climbfuji who may have a more immediate answer on matters of OpenMPI on Hera.

All I can see in terms of system modules is openmpi/4.1.6_gnu9.2.0 ...

@GeorgeGayno-NOAA
Copy link
Collaborator Author

I'll look into this, but tagging @climbfuji who may have a more immediate answer on matters of OpenMPI on Hera.

All I can see in terms of system modules is openmpi/4.1.6_gnu9.2.0 ...

When I tried openmpi 4.1.6, other libraries would no longer load. I went around in circles before giving up.

@GeorgeGayno-NOAA
Copy link
Collaborator Author

I have v1.7 working. I will check in my branch so you can take a look.

@AlexanderRichert-NOAA
Copy link
Collaborator

Also tagging @RatkoVasic-NOAA in case he knows of recent changes -- it looks like the modification date on the openmpi module file is this last Tuesday the 11th.

I just created an issue for this under spack-stack: JCSDA/spack-stack#1146

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 14, 2024
@RatkoVasic-NOAA
Copy link
Contributor

@GeorgeGayno-NOAA @AlexanderRichert-NOAA
Yes. openmpi/4.1.5 was built on CeontOS and new one (openmpi/4.1.6) was built on Rocky OS. Since that transition some applications were not working correctly with new GNU (i.e. couple of coupled tests in ufs-weather-model which are still turned off).
Natalie was (is) working on installing libraries using newer version of GNU (13.x) + openmpi/4.1.6 and had some success, but still not finished.

@AlexanderRichert-NOAA
Copy link
Collaborator

I'm confused-- Why was it working a few days ago but isn't now? Did someone revert the configuration back to trying to use 4.1.5..?

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 14, 2024
@RatkoVasic-NOAA
Copy link
Contributor

We don't use 4.1.5 for some time (/scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles/openmpi/4.1.5). We use in SRW now (going with spack-stack 1.6.0):

prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/openmpi/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/spack-stack/spack-stack-1.6.0_gnu13/envs/ufs-wm-srw-rocky8/install/modulefiles/Core")

load("stack-gcc/13.3.0")
load("stack-openmpi/4.1.6")
load("stack-python/3.10.13")
load("cmake/3.23.1")

load("srw_common")

load(pathJoin("nccmp", os.getenv("nccmp_ver") or "1.9.0.1"))
load(pathJoin("nco", os.getenv("nco_ver") or "5.1.6"))
load(pathJoin("openblas", os.getenv("openblas_ver") or "0.3.24"))

prepend_path("CPPFLAGS", " -I/apps/slurm_hera/23.11.3/include/slurm"," ")
prepend_path("LD_LIBRARY_PATH", "/apps/slurm_hera/23.11.3/lib")
setenv("LD_PRELOAD", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/13.3.0/lib64/libstdc++.so.6")

@AlexanderRichert-NOAA
Copy link
Collaborator

I don't follow. How is it that the modules/MODULEPATH settings in https://github.com/ufs-community/UFS_UTILS/blob/develop/modulefiles/build.hera.gnu.lua were working until a few days ago but aren't working now? Did something about the modulefiles change so that it's not pointing to the spack-stack-specific OpenMPI 4.1.5 installation?

@RatkoVasic-NOAA
Copy link
Contributor

I didn't know about UFS_UTILS... I was talking about WM and SRW. I can take a look into that modulefile.

@RatkoVasic-NOAA
Copy link
Contributor

@GeorgeGayno-NOAA try now

@RatkoVasic-NOAA
Copy link
Contributor

@AlexanderRichert-NOAA I manually added line:

@RatkoVasic-NOAA
Copy link
Contributor

prepend_path("MODULEPATH", "/scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles")
in
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/gcc/9.2.0/stack-openmpi/4.1.5.lua

@AlexanderRichert-NOAA
Copy link
Collaborator

Thanks. I can now load the stack-openmpi module, and for that matter build UFS_UTILS@develop without any modifications.

@GeorgeGayno-NOAA
Copy link
Collaborator Author

UFS_UTILS now compiles, but the regression tests fail:

+ srun /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/chgres_cube/../../exec/chgres_cube '1>&1' '2>&2'
[h22c32:2743796] mca_base_component_repository_open: unable to open mca_pmix_s1: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)

For more details, see this log file: /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/chgres_cube/consistency.log01.fail

@AlexanderRichert-NOAA
Copy link
Collaborator

Can you try /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8-ompi416/install/modulefiles/Core? Note the openmpi version change to 4.1.6. This stack uses the Hera admin-provided openmpi (I'm not sure why this wasn't used in the rocky8 rebuild for 1.6.0).

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 18, 2024
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 18, 2024
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 18, 2024
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 18, 2024
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 21, 2024
@GeorgeGayno-NOAA
Copy link
Collaborator Author

Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.

The first global_cycle regression test had a seg fault in the sfcsub.F routine.

 qc of snow
 snow set to zero over open sea at       363185  points (   61.575147840711807      percent)
 performing qc of snow     mode=           1 (0=count only, 1=replace)
 set snow temp to tsfsmx if greater
 performing qc of tsfc     mode=           1 (0=count only, 1=replace)
 performing qc of tsf2     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of zorc     mode=           1 (0=count only, 1=replace)

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped)
srun: Terminating StepId=62230736.0
slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 ***
srun: error: h35m50: tasks 0-3: Terminated
srun: Force Terminated StepId=62230736.0
+ export ERR=143

Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jun 26, 2024
@GeorgeGayno-NOAA
Copy link
Collaborator Author

Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.

The first global_cycle regression test had a seg fault in the sfcsub.F routine.

 qc of snow
 snow set to zero over open sea at       363185  points (   61.575147840711807      percent)
 performing qc of snow     mode=           1 (0=count only, 1=replace)
 set snow temp to tsfsmx if greater
 performing qc of tsfc     mode=           1 (0=count only, 1=replace)
 performing qc of tsf2     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of zorc     mode=           1 (0=count only, 1=replace)

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped)
srun: Terminating StepId=62230736.0
slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 ***
srun: error: h35m50: tasks 0-3: Terminated
srun: Force Terminated StepId=62230736.0
+ export ERR=143

Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.

This test was repeated with 05b6fc2. The results were the same.

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jul 3, 2024
@GeorgeGayno-NOAA
Copy link
Collaborator Author

Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.

The first global_cycle regression test had a seg fault in the sfcsub.F routine.

 qc of snow
 snow set to zero over open sea at       363185  points (   61.575147840711807      percent)
 performing qc of snow     mode=           1 (0=count only, 1=replace)
 set snow temp to tsfsmx if greater
 performing qc of tsfc     mode=           1 (0=count only, 1=replace)
 performing qc of tsf2     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of zorc     mode=           1 (0=count only, 1=replace)

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped)
srun: Terminating StepId=62230736.0
slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 ***
srun: error: h35m50: tasks 0-3: Terminated
srun: Force Terminated StepId=62230736.0
+ export ERR=143

Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.

This test was repeated using 4dca77a. The results were the same.

@GeorgeGayno-NOAA
Copy link
Collaborator Author

I just tried compiling develop using Gnu using 2794d41 (the hash which prompted this issue) and 3ef2e6b. It works!

@RatkoVasic-NOAA - what is going on?

@AlexanderRichert-NOAA
Copy link
Collaborator

We did a rebuild of openmpi on Hera recently, as previously we were trying to use the copy built under CentOS. Does that possibly explain what you're referring to?

@GeorgeGayno-NOAA
Copy link
Collaborator Author

We did a rebuild of openmpi on Hera recently, as previously we were trying to use the copy built under CentOS. Does that possibly explain what you're referring to?

Ok. That must explain why it is working again. In my branch (#965) I point to another stack. Should I revert back to what was used before?

@AlexanderRichert-NOAA
Copy link
Collaborator

The only difference between the two environments is that unified-env-rocky8-ompi416 uses the sys admin-installed openmpi/4.1.6_gnu9.2.0 module, whereas unified-env-rocky8 uses a copy of openmpi 4.1.5 built by the spack-stack team (I'm not sure offhand why we have both). I would recommend using unified-env-rocky8-ompi416 since it's not clear what if any network fabric support the other openmpi was built with (whereas the sys admin-installed openmpi 4.1.6 is definitely built with UCX support which is almost certainly what you want).

@GeorgeGayno-NOAA
Copy link
Collaborator Author

The only difference between the two environments is that unified-env-rocky8-ompi416 uses the sys admin-installed openmpi/4.1.6_gnu9.2.0 module, whereas unified-env-rocky8 uses a copy of openmpi 4.1.5 built by the spack-stack team (I'm not sure offhand why we have both). I would recommend using unified-env-rocky8-ompi416 since it's not clear what if any network fabric support the other openmpi was built with (whereas the sys admin-installed openmpi 4.1.6 is definitely built with UCX support which is almost certainly what you want).

Thanks. Do you have any further comments on #965? I would like to merge it.

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants