Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel runs with routing crash #370

Closed
hellkite500 opened this issue Jan 28, 2022 · 8 comments · Fixed by #470
Closed

Parallel runs with routing crash #370

hellkite500 opened this issue Jan 28, 2022 · 8 comments · Fixed by #470
Assignees

Comments

@hellkite500
Copy link
Member

When running parallel framework runs with routing enabled, ngen crashes trying to load the routing module . This crash occurs after the catchment formulations complete.

Current behavior

A segmentation fault occurs trying to use pybind for the routing integration in parallel.

Expected behavior

The parallel formulation execution should finish, and rank 0 should initialize and execute the t-route routing module.

@hellkite500 hellkite500 self-assigned this Jan 28, 2022
@hellkite500
Copy link
Member Author

hellkite500 commented Feb 4, 2022

example_crash_input.tar.gz

Attached here is a small, reproducible example when ngen is built with gcc 8.3.1.
Build:

cmake3 -DQUIET:=On -DBMI_C_LIB_ACTIVE:=On -DNGEN_ACTIVATE_PYTHON:BOOL=ON -DNGEN_ACTIVATE_ROUTING:BOOL=ON -DMPI_ACTIVE:=On ..

Running with:
mpirun -n 3 ../ngen catchment_data.geojson '' nexus_data.geojson '' realization_config_short.json partitions.json

Causes ngen to crash after the catchment formulations are run, but before the routing is launched.

@hellkite500
Copy link
Member Author

Could NOT reproduce using gcc 6.3.1

@hellkite500
Copy link
Member Author

Just to note, as I continue to investigate this issue, I have noticed that even on gcc 8.3.1 environment, this crash is not deterministic. It acts like a pybind object is trying to be cleaned up at the end of main but after the interpreter has been removed.

The seg-fault does not happen on MPI rank 0, where the routing adapter is run, but instead comes from one of the other ranks. On these ranks, a pybind interpreter is used to run the model formulations. These should be independent interpreters, one per MPI processes. I'm currently trying to debug the destructor chains to see if I can point to some scenario where the interpreter isn't available during the destruction of any bound python object.

@mattw-nws
Copy link
Contributor

mattw-nws commented May 17, 2022

Completely unrelated execution (this was a non-parallel build and had no routing in the realization config), but I happened to get this error when trying to run valgrind for a completely different reason:

> Executing task: valgrind-debug: valgrind-debug <

Starting valgrind...
==9909== Memcheck, a memory error detector
==9909== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9909== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==9909== Command: ./cmake_build/ngen
==9909== 

==9909== (action at startup) vgdb me ... 
==9909== 
==9909== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==9909==   /path/to/gdb ./cmake_build/ngen
==9909== and then give GDB the following command
==9909==   target remote | /opt/rh/devtoolset-8/root/usr/lib64/valgrind/../../bin/vgdb --pid=9909
==9909== --pid is optional if only one valgrind process is running
==9909== 

==9909== 
==9909== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==9909==   /path/to/gdb ./cmake_build/ngen
==9909== and then give GDB the following command
==9909==   target remote | /opt/rh/devtoolset-8/root/usr/lib64/valgrind/../../bin/vgdb --pid=9909
==9909== --pid is optional if only one valgrind process is running
==9909== 

NGen Framework 0.1.0

==9909== Invalid read of size 4
==9909==    at 0x51690F2: PyObject_Free (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AE7AB: ??? (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AFB51: PyDict_SetItem (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51799FB: PyType_Ready (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5258030: _PyTypes_Init (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x52C5351: Py_InitializeFromConfig (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5165300: Py_InitializeEx (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)

==9909==    by 0x42F556: pybind11::initialize_interpreter(bool) (./extern/pybind11/include/pybind11/embed.h:107)
==9909==    by 0x42F826: pybind11::scoped_interpreter::scoped_interpreter(bool) (./extern/pybind11/include/pybind11/embed.h:184)
==9909==    by 0x4AFA61: void __gnu_cxx::new_allocator<pybind11::scoped_interpreter>::construct<pybind11::scoped_interpreter>(pybind11::scoped_interpreter*) (/opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:136)
==9909==    by 0x4A6246: void std::allocator_traits<std::allocator<pybind11::scoped_interpreter> >::construct<pybind11::scoped_interpreter>(std::allocator<pybind11::scoped_interpreter>&, pybind11::scoped_interpreter*) (/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/alloc_traits.h:475)
==9909==    by 0x49ADEA: std::_Sp_counted_ptr_inplace<pybind11::scoped_interpreter, std::allocator<pybind11::scoped_interpreter>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<>(std::allocator<pybind11::scoped_interpreter>) (/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/shared_ptr_base.h:545)
==9909==  Address 0x7781020 is 400 bytes inside a block of size 2,208 free'd
==9909==    at 0x4C2AFBD: free (/builddir/build/BUILD/valgrind-3.14.0/coregrind/m_replacemalloc/vg_replace_malloc.c:540)
==9909==    by 0x51AE7AB: ??? (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AF13C: PyDict_SetDefault (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AF21F: PyUnicode_InternInPlace (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AF2B9: PyUnicode_InternFromString (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5277152: ??? (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x52771AE: PyDescr_NewMethod (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5179B46: PyType_Ready (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5258030: _PyTypes_Init (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x52C5351: Py_InitializeFromConfig (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5165300: Py_InitializeEx (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)

==9909==    by 0x42F556: pybind11::initialize_interpreter(bool) (./extern/pybind11/include/pybind11/embed.h:107)
==9909==  Block was alloc'd at
==9909==    at 0x4C29EC3: malloc (/builddir/build/BUILD/valgrind-3.14.0/coregrind/m_replacemalloc/vg_replace_malloc.c:309)
==9909==    by 0x5171367: PyObject_Malloc (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AE498: ??? (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AF13C: PyDict_SetDefault (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AF21F: PyUnicode_InternInPlace (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x51AF2B9: PyUnicode_InternFromString (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5276FE8: ??? (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5178B6F: PyType_Ready (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5258030: _PyTypes_Init (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x52C5351: Py_InitializeFromConfig (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)
==9909==    by 0x5165300: Py_InitializeEx (in /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0)

==9909==    by 0x42F556: pybind11::initialize_interpreter(bool) (./extern/pybind11/include/pybind11/embed.h:107)
==9909== 
==9909== (action on error) vgdb me ... 

Possibly related???

@mattw-nws
Copy link
Contributor

mattw-nws commented Aug 24, 2022

Notably, I encountered some stability issues with routing and ngen/Python... these were on rank 0, but had to do with the HDF5 library--pytables brings along its own binary, and we were building ngen with another. I have solved the issues with this by building pytables from source with the same libhdf5 as ngen... this could lead to an issue like this on a non-rank-0 process, in theory, if either pytables or HDF5 was used...maybe? In any case, this or other binary libraries loaded into Python modules with pybind that may not match libraries loaded in ngen should be looked at in relation to this.

@hellkite500
Copy link
Member Author

This may not be entirely parallel related. A similar issue seems to have come up during calibration runs. Reported as random seg faults during the calibration runs, but the symptoms are eerily similar. Notes on a reported crash:

compiler: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0
output log:

Finished 59161 timesteps.
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
'HDF5_DISABLE_VERSION_CHECK' environment variable is set to 1, application will
continue at your own risk.
Headers are 1.12.2, library is 1.10.4
	    SUMMARY OF THE HDF5 CONFIGURATION
	    =================================
General Information:
-------------------
                   HDF5 Version: 1.10.4
                  Configured on: Mon, 13 Apr 2020 12:15:08 +0000
                  Configured by: Debian
                    Host system: x86_64-pc-linux-gnu
              Uname information: Debian
                       Byte sex: little-endian
             Installation point: /usr
		    Flavor name: serial
Compiling Options:
------------------
                     Build Mode: production
              Debugging Symbols: no
                        Asserts: no
                      Profiling: no
             Optimization Level: high
Linking Options:
----------------
                      Libraries: static, shared
  Statically Linked Executables:
                        LDFLAGS: -Wl,-Bsymbolic-functions -Wl,-z,relro
                     H5_LDFLAGS: -Wl,--version-script,$(top_srcdir)/debian/map_serial.ver
                     AM_LDFLAGS:
                Extra libraries: -lpthread -lsz -lz -ldl -lm
                       Archiver: ar
                       AR_FLAGS: cr
                         Ranlib: x86_64-linux-gnu-ranlib
Languages:
----------
                              C: yes
                     C Compiler: /usr/bin/gcc
                       CPPFLAGS: -Wdate-time -D_FORTIFY_SOURCE=2
                    H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200112L   -DNDEBUG -UH5_DEBUG_API
                    AM_CPPFLAGS:
                        C Flags: -g -O2 -fdebug-prefix-map=$(top_srcdir)=. -fstack-protector-strong -Wformat -Werror=format-security
                     H5 C Flags:  -std=c99  -pedantic -Wall -Wextra -Wbad-function-cast -Wc++-compat -Wcast-align -Wcast-qual -Wconversion -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-equal -Wformat=2 -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-include-dirs -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpacked -Wpointer-arith -Wredundant-decls -Wshadow -Wstrict-prototypes -Wswitch-default -Wswitch-enum -Wundef -Wunused-macros -Wunsafe-loop-optimizations -Wwrite-strings -finline-functions -s -Wno-inline -Wno-aggregate-return -Wno-missing-format-attribute -Wno-missing-noreturn -O
                     AM C Flags:
               Shared C Library: yes
               Static C Library: yes
                        Fortran: yes
               Fortran Compiler: /usr/bin/gfortran
                  Fortran Flags: -g -O2 -fdebug-prefix-map=$(top_srcdir)=. -fstack-protector-strong
               H5 Fortran Flags:  -pedantic -Wall -Wextra -Wunderflow -Wimplicit-interface -Wsurprising -Wno-c-binding-type  -s -O2
               AM Fortran Flags:
         Shared Fortran Library: yes
         Static Fortran Library: yes
                            C++: yes
                   C++ Compiler: /usr/bin/g++
                      C++ Flags: -g -O2 -fdebug-prefix-map=$(top_srcdir)=. -fstack-protector-strong -Wformat -Werror=format-security
                   H5 C++ Flags:   -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wredundant-decls -Winline -Wsign-promo -Woverloaded-virtual -Wold-style-cast -Weffc++ -Wreorder -Wnon-virtual-dtor -Wctor-dtor-privacy -Wabi -finline-functions -s -O
                   AM C++ Flags:
             Shared C++ Library: yes
             Static C++ Library: yes
                           Java: yes
                  Java Compiler: /usr/bin/java (openjdk 11.0.7-ea 2020-04-14)
Features:
---------
                   Parallel HDF5: no
Parallel Filtered Dataset Writes: no
              Large Parallel I/O: no
              High-level library: yes
                    Threadsafety: yes
             Default API mapping: v18
  With deprecated public symbols: yes
          I/O filters (external): deflate(zlib),szip(encoder)
                             MPE: no
                      Direct VFD: no
                         dmalloc: no
  Packages w/ extra debug output: none
                     API tracing: no
            Using memory checker: no
 Memory allocation sanity checks: no
             Metadata trace file: no
          Function stack tracing: no
       Strict file format checks: no
    Optimization instrumentation: no
Finished routing
/home/west/git_repositories/ngen_10242022/ngen/venv/lib/python3.8/site-packages/h5py/__init__.py:36: UserWarning: h5py is running against HDF5 1.10.4 when it was built against 1.12.2, this may cause problems
  _warn(("h5py is running against HDF5 {0} when it was built against {1}, "
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:597: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:601: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:597: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:601: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/nwm_routing/src/nwm_routing/__main__.py:566: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->axis0] [items->None]
  flowveldepth.loc[csv_output_segments].to_hdf(output_path.joinpath(filename_fvd), key="qvd")
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/nwm_routing/src/nwm_routing/__main__.py:566: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_items] [items->None]
  flowveldepth.loc[csv_output_segments].to_hdf(output_path.joinpath(filename_fvd), key="qvd")
creating supernetwork connections set
supernetwork connections set complete
... in 0.007495403289794922 seconds.
setting channel initial states ...
channel initial states complete
... in 1.430511474609375e-06 seconds.
creating qlateral array ...
qlateral array complete
... in 61.56017208099365 seconds.
WARNING: Lateral flow time series is larger than provided nts. Adjusting nts.
If this was unintended, double check the configuration number of time steps and the lateral flow input time series
executing routing computation ...
JIT Preprocessing time 5.0067901611328125e-05 seconds.
starting Parallel JIT calculation
PARALLEL TIME 0.4558742046356201 seconds.
ordered reach computation complete
... in 0.45807576179504395 seconds.
Handling output ...
- writing flow, velocity, and depth results to .csv
output complete
... in 4.025044918060303 seconds.
process complete
66.17824506759644 seconds.
Segmentation fault (core dumped)

In the output, you can see that routing has finished process complete, so I'm assuming control passed back to ngen at that point and the seg fault occurs trying to finish the main loop (which involves destructing the python interpreter) and the segmentation faults occurs.

This is very hard to debug with certainty, as it may takes hundreds of executions to reproduce this error (the calibration runs were up to 300+ iterations when this randomly occurred.)

The difference in this serial run and the parallel is that in parallel, the "non-routing" ranks cause the error so routing never finishes. In the case this is triggered in serial, the routing is able to finish before the seg fault occurs in the destruction chain.

as @mattw-nws noted, this MAY be related to modules built and linking against mis-matched binary versions at runtime in the embedded interpreter. I'm not real sure what that would do in the destructor chain here.

@hellkite500 hellkite500 mentioned this issue Nov 8, 2022
12 tasks
@hellkite500
Copy link
Member Author

I was also finally able to reproduce this on serial runs using

Apple clang version 14.0.0 (clang-1400.0.29.102)
Target: arm64-apple-darwin21.6.0

It is indeed an issue in the order of destruction of resources where the python interpreter is shutdown before the destruction of the locally held module/objects that the utility is holding are destroyed. I'm pretty sure this is due to the static singleton use and the destruction order of static variables across compilation units is not well defined.

I have a fix that I should get pushed to a PR soon that at least resolves the persistent seg fault I was able to produce locally. Will need to test that fix on the reproducible example above using the known failing compiler configuration.

hellkite500 added a commit to hellkite500/ngen that referenced this issue Nov 8, 2022
@hellkite500 hellkite500 mentioned this issue Nov 8, 2022
12 tasks
hellkite500 added a commit to hellkite500/ngen that referenced this issue Nov 9, 2022
hellkite500 added a commit to hellkite500/ngen that referenced this issue Nov 9, 2022
@hellkite500
Copy link
Member Author

Tested #470 on gcc 8.3.1 in parallel, and no longer get crashes as non-routing ranks after they finish running catchments and shutdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants