Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to create lnd_mesh.nc file from global 0.01x0.01 deg SCRIP file #2479

Open
olyson opened this issue Apr 18, 2024 · 3 comments
Open
Assignees
Labels
priority: low Background task that doesn't need to be done right away. test: mksurfdata Test mksurfdata_esmf before merging

Comments

@olyson
Copy link
Contributor

olyson commented Apr 18, 2024

Brief summary of bug

Failure to create lnd_mesh.nc file from global 0.01x0.01 deg SCRIP file

General bug information

CTSM version you are using: NA

Does this bug cause significantly incorrect results in the model's science? No

Details of bug

As part of the steps required to create a global 0.01x0.01 surface dataset, one step that fails on Derecho is:

/glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7haa/bin/ESMF_Scrip2Unstruct /glade/work/oleson/release-cesm2.2.0/components/clm/tools/mkmapgrids/SCRIPgrid_36000x18000pt_Global_nomask_c240418.nc lnd_mesh.nc 0

This fails immediately with "killed"

Tried:

qcmd -- /glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7haa/bin/ESMF_Scrip2Unstruct /glade/work/oleson/release-cesm2.2.0/components/clm/tools/mkmapgrids/SCRIPgrid_36000x18000pt_Global_nomask_c240418.nc lnd_mesh.nc 0

For reference, the qcmd specifics are: qsub -l select=1:ncpus=32:mem=55GB -A P93300041 -q develop@desched1 -l walltime=01:00:00

This fails with:

Segmentation fault (core dumped)

Tried:

qsub -l select=1:ncpus=1:mem=235GB -q develop -A P93300641 -l walltime=01:00:00 -- /glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7hhaa/bin/ESMF_Scrip2Unstruct /glade/work/oleson/release-cesm2.2.0/components/clm/tools/mkmapgrids/SCRIPgrid_36000x18000pt_Global_nomask_c240418.nc lnd_mesh.nc 0

This fails with:

/glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7haa/bin/ESMF_Scrip2Unstruct: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory

I've tried some other combinations of approaches on both Derecho and Casper and none have worked.

I'll note that the creation of a 0.05x0.05 deg global lnd_mesh file does work, and the rest of the process to create a 0.05x0.05 surface dataset does work.

@olyson olyson added tag: support tools only next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Apr 18, 2024
@olyson olyson self-assigned this Apr 18, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Apr 18, 2024

I have one thing to try @olyson. It looks like ESMF_Scrip2Unstruct IS a parallel program. So if you setup batch for it and give it multiple processors (and use a version of ESMF that uses a full MPI library), this likely WILL work.

The version you have above did link in a full MPI library. So you should just be able to modify the batch commands to use multiple processors, and to add "mpibind" in front of the call to ESMF_Scrip2Unstruct.

If it doesn't we point ESMF people to it. But, also 0.05 degree's seems acceptable for most uses...

@olyson
Copy link
Contributor Author

olyson commented Apr 22, 2024

Thanks for the suggestion @ekluzek . I made a batch script which works for 0.1x0.1 and 0.05x0.05, but 0.01 fails. I've tried various combinations of nodes, cores, memory. No errors in either the PET logs or in stderr/stdout most of the time. Occasionally I get an error like this in stdout:

dec1296.hsn.de.hpc.ucar.edu: rank -1 died from signal 9
dec1281.hsn.de.hpc.ucar.edu: rank 1 died from signal 15

On a side note, I use mkscripgrid.ncl to create the SCRIP file. I had to add the following to increase the file size:

Opt@LargeFile = True
Opt@NetCDFType = "netcdf4"

which creates a netcdf4 file. I tried to use nccopy to convert it to cdf5 but got this error:

NetCDF: Not a valid data type or _FillValue type mismatch

I eventually found that nccopy did not like the fact that "string" was prepended to the global attributes in the file, e.g.,

string :Createdby = "ESMF_regridding.ncl

I deleted the global attributes and then nccopy worked to create a cdf5 file.

For future reference, my script is here:

/glade/work/oleson/release-cesm2.1.3/components/clm/tools/mkmapgrids/submit_Scrip2Unstruct_derecho.csh

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 22, 2024

OK, I'm going to point the ESMF people to this.

@wwieder wwieder added priority: low Background task that doesn't need to be done right away. and removed next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Apr 25, 2024
@samsrabin samsrabin added test: mksurfdata Test mksurfdata_esmf before merging and removed support tools only labels Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: low Background task that doesn't need to be done right away. test: mksurfdata Test mksurfdata_esmf before merging
Projects
None yet
Development

No branches or pull requests

4 participants