-
Notifications
You must be signed in to change notification settings - Fork 342
Description
Brief summary of bug
MPI tests with DEBUG on are failing at runtime with the nvhpc compiler on cheyenne.
This continues in ctsm5.1.dev155-38-g5c8f17b1a (derecho1 branch) on derecho
General bug information
CTSM version you are using: ctsm5.1.dev082 in cesm2_3_alpha08d
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: tests with nvhpc and DEBUG on
Details of bug
These tests fail:
SMS_D.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_D.f45_f45_mg37.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
While DEBUG off tests PASS:
SMS.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
As well as mpi-serial tests:
SMS_D_Ld1_Mmpi-serial.1x1_brazil.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Ld1_Mmpi-serial.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
SMS_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
Important details of your setup / configuration so we can reproduce the bug
Important output or errors that show the problem
For the smallest case: SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
The only log file available is the cesm.log file as follows.
cesm.log file:
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
[r12i4n4:35002:0:35002] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35003:0:35003] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35004:0:35004] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35006:0:35006] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35007:0:35007] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35008:0:35008] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35010:0:35010] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35011:0:35011] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35012:0:35012] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35013:0:35013] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35014:0:35014] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35015:0:35015] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35017:0:35017] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35018:0:35018] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35019:0:35019] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35020:0:35020] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35022:0:35022] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35000:0:35000] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35001:0:35001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35016:0:35016] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
==== backtrace (tid: 35022) ====
0 /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2ba9d97301a4]
1 /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a4cc) [0x2ba9d97304cc]
2 /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a73b) [0x2ba9d973073b]
3 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6LogErr13MsgFoundErrorEiPKciS2_S2_Pi+0x34) [0x2ba9b78f4c74]
4 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap22meshcreatenodedistgridEPi+0x7f) [0x2ba9b7b15ebf]
5 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatenodedistgrid_+0xc1) [0x2ba9b7b61141]
6 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshaddelements_+0xbc0) [0x2ba9b881c880]
7 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromunstruct_+0x4d0f) [0x2ba9b88246cf]
8 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromfile_+0x270) [0x2ba9b881f270]
9 /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x15d8fd0]
10 /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x632341]
11 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc30) [0x2ba9b77436b0]
12 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ba9b773e913]
13 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ba9b7f7b9fb]
14 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ba9b7fa3bbe]
15 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ba9b773edd3]
16 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xa26) [0x2ba9b82d2c66]
17 /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ba9b85a5ede]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status