Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault thrown when Python is finalizing ESMF through pybind11 #748

Open
program-- opened this issue Feb 28, 2024 · 4 comments · May be fixed by #765
Open

Segfault thrown when Python is finalizing ESMF through pybind11 #748

program-- opened this issue Feb 28, 2024 · 4 comments · May be fixed by #765
Labels
bug Something isn't working

Comments

@program--
Copy link
Contributor

program-- commented Feb 28, 2024

This segfault occurs at a call to ESMCI::VM::finalize(ESMC_Logical*, int*) at program termination. Part of this issue may be the atexit hooks that ESMF creates on initialization of its objects (Mesh, Grid, Field, etc.), where destruction is not occurring in the "correct" order, and the finalization function attempts to dereference an object that doesn't exist anymore (speculation).

The location of this issue was found during the Forcings Engine integration work by enabling address sanitizer.

Potentially related to #470

@program-- program-- added the bug Something isn't working label Feb 28, 2024
@program--
Copy link
Contributor Author

program-- commented Feb 29, 2024

Interestingly, changed:

  • Prevent mpi4py from calling Finalize (not sure if this had any effect because mpi4py might not call it if MPI was initialized before loading it...)
  • Ensure MPI_Init and MPI_Finalize are called correctly in the unit test (though the error begs to differ)
  • Ensure static construction order of GIL and DataProvider

and this left me with the error:

Abort(805361423) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
PMPI_Init(109): Cannot call MPI_INIT or MPI_INIT_THREAD more than once
LLVMSymbolizer: error reading file: No such file or directory

when running with MPI enabled. MPI disabled gives me the same segfault, but the above errors don't actually segfault and the unit test program exits normally...

@program--
Copy link
Contributor Author

After talking with @hellkite500 and @PhilMiller, I can confirm at the very least that this is due to an atexit hook created in Python (my guess is via esmpy) that uses MPI functions.

By changing the unit test tear down function to:

static void TearDownTestSuite()
{
    gil_->getModule("atexit").attr("_run_exitfuncs")();
    #if NGEN_WITH_MPI
    MPI_Finalized(&mpi_final);
    if (mpi_final == 0) {
        MPI_Finalize();
    }
    #endif
}

where I forcibly call all atexit functions created in the Python interpreter, the program exits safely, with no errors or segfaults.

@PhilMiller
Copy link
Contributor

Working with Justin, we validated using the PMPI interfaces to 'detour' the MPI_Finalize call in ESMF.

The way this works is that we provide a 'profiling' implementation of MPI_Finalize in a small shared object. This implementation will actually be a no-op. ESMF's call to MPI_Finalize will resolve to call this implementation, and hence have no harmful side effects. Then, in our code (the tests and NGen.cpp), we explicitly call PMPI_Finalize as implemented by the MPI library, to actually shut down MPI.

@program--
Copy link
Contributor Author

As an update for this, esmf-org/esmf#234 was merged which solves half of this issue (assuming the user has an ESMF build with this feature). The other half left is ensuring destruction of ESMF Grid/Mesh/Field objects used in the Forcings Engine python code (NOAA-OWP/ngen-forcing#14).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants