Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash of OMIP2 simulation using TL319_tn14 grid and noresm2_5_alpha04_v3 #387

Open
matsbn opened this issue Aug 29, 2024 · 0 comments
Open
Assignees

Comments

@matsbn
Copy link
Contributor

matsbn commented Aug 29, 2024

Using branch feature/noresm2_5_alpha04_v3 of https://github.com/mvertens/NorESM.git, compset 2000_DATM%JRA_SLND_CICE_BLOM_DROF%JRA_SGLC_SWAV_SESP and grid combination TL319_tn14, the simulation crashed at what seemed the first attempt to write CICE diagnostics.

The error message in cesm.log.* was:
[b4167:12257] *** An error occurred in MPI_Gather
[b4167:12257] *** reported by process [47501952548864,123]
[b4167:12257] *** on communicator MPI COMMUNICATOR 49 SPLIT FROM 44
[b4167:12257] *** MPI_ERR_TRUNCATE: message truncated
[b4167:12257] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b4167:12257] *** and potentially your MPI job)

I had a feeling it could have something to do with LFS stripes and I saw that in env_run.xml, PIO_STRIDE was set to $MAX_MPITASKS_PER_NODE. This is 128 on Betzy, but I was running with fewer processors that this for CICE (96 processors). When manually setting PIO_STRIDE to 8 for all components (a bit random), the simulation ran fine. Not sure this is the reason for the crash, but if it is, maybe PIO_STRIDE should be set to the minimum of $MAX_MPITASKS_PER_NODE and processors per component?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants