Skip to content

pmix dstore esh bus error in orted on cray in dvm mode #2737

Closed
@marksantcroos

Description

@marksantcroos

With the latest master, in dvm mode, after running around a couple of thousand tasks I repeatedly run into the following:

Core was generated by `orted -mca orte_debug "1" -mca orte_debug_daemons "1" --hnp-topo-sig 0N:1S:1L3:'.
Program terminated with signal 7, Bus error.
(gdb) bt
#0  0x00002aaaadcaa85b in __memset_sse2 () from /lib64/libc.so.6
#1  0x00002aaaac2b2718 in _create_new_segment (type=NS_META_SEGMENT, ns_map=0x2aaaae3a76d0, id=0)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:1456
#2  0x00002aaaac2b2bcb in _update_ns_elem (ns_elem=0x2aaab3b3bde0, info=0x2aaaae3a76d0)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:1550
#3  0x00002aaaac2afce1 in _esh_store (nspace=0xe2fb2c8 "528289639", rank=4294967294, kv=0xdf52210)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:918
#4  0x00002aaaac2ad2cb in pmix_dstore_store (nspace=0xe2fb2c8 "528289639", rank=4294967294, kv=0xdf52210)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_dstore.c:66
#5  0x00002aaaac28fa9d in _rank_key_dstore_store (cbdata=0xe01e8c0) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c:96
#6  0x00002aaaac2917a7 in _job_data_store (nspace=0xd1aca68 "528289639", cbdata=0xe01e8c0)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c:386
#7  0x00002aaaac28fc2c in pmix_job_data_dstore_store (nspace=0xd1aca68 "528289639", bptr=0xe01cf80)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c:118
#8  0x00002aaaac2658ff in _register_nspace (sd=-1, args=4, cbdata=0xd1ac9b0) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/server/pmix_server.c:459
#9  0x00002aaaab1e4151 in event_process_active_single_queue (activeq=0x7079c0, base=0x707730)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/event/libevent2022/libevent/event.c:1370
#10 event_process_active (base=<optimized out>) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/event/libevent2022/libevent/event.c:1440
#11 opal_libevent2022_event_base_loop (base=0x707730, flags=1) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/event/libevent2022/libevent/event.c:1644
#12 0x00002aaaac2b7efa in progress_engine (obj=0x7072d8) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/runtime/pmix_progress_threads.c:149
#13 0x00002aaaada0e806 in start_thread () from /lib64/libpthread.so.0
#14 0x00002aaaadd029bd in clone () from /lib64/libc.so.6
#15 0x0000000000000000 in ?? ()

Will dig further, but increasing the set of eyes looking at it.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions