Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmix dstore esh bus error in orted on cray in dvm mode #2737

Closed
marksantcroos opened this issue Jan 16, 2017 · 40 comments
Closed

pmix dstore esh bus error in orted on cray in dvm mode #2737

marksantcroos opened this issue Jan 16, 2017 · 40 comments
Assignees

Comments

@marksantcroos
Copy link
Contributor

With the latest master, in dvm mode, after running around a couple of thousand tasks I repeatedly run into the following:

Core was generated by `orted -mca orte_debug "1" -mca orte_debug_daemons "1" --hnp-topo-sig 0N:1S:1L3:'.
Program terminated with signal 7, Bus error.
(gdb) bt
#0  0x00002aaaadcaa85b in __memset_sse2 () from /lib64/libc.so.6
#1  0x00002aaaac2b2718 in _create_new_segment (type=NS_META_SEGMENT, ns_map=0x2aaaae3a76d0, id=0)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:1456
#2  0x00002aaaac2b2bcb in _update_ns_elem (ns_elem=0x2aaab3b3bde0, info=0x2aaaae3a76d0)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:1550
#3  0x00002aaaac2afce1 in _esh_store (nspace=0xe2fb2c8 "528289639", rank=4294967294, kv=0xdf52210)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:918
#4  0x00002aaaac2ad2cb in pmix_dstore_store (nspace=0xe2fb2c8 "528289639", rank=4294967294, kv=0xdf52210)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_dstore.c:66
#5  0x00002aaaac28fa9d in _rank_key_dstore_store (cbdata=0xe01e8c0) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c:96
#6  0x00002aaaac2917a7 in _job_data_store (nspace=0xd1aca68 "528289639", cbdata=0xe01e8c0)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c:386
#7  0x00002aaaac28fc2c in pmix_job_data_dstore_store (nspace=0xd1aca68 "528289639", bptr=0xe01cf80)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c:118
#8  0x00002aaaac2658ff in _register_nspace (sd=-1, args=4, cbdata=0xd1ac9b0) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/server/pmix_server.c:459
#9  0x00002aaaab1e4151 in event_process_active_single_queue (activeq=0x7079c0, base=0x707730)
    at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/event/libevent2022/libevent/event.c:1370
#10 event_process_active (base=<optimized out>) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/event/libevent2022/libevent/event.c:1440
#11 opal_libevent2022_event_base_loop (base=0x707730, flags=1) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/event/libevent2022/libevent/event.c:1644
#12 0x00002aaaac2b7efa in progress_engine (obj=0x7072d8) at /ccs/home/marksant1/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/runtime/pmix_progress_threads.c:149
#13 0x00002aaaada0e806 in start_thread () from /lib64/libpthread.so.0
#14 0x00002aaaadd029bd in clone () from /lib64/libc.so.6
#15 0x0000000000000000 in ?? ()

Will dig further, but increasing the set of eyes looking at it.

@rhc54
Copy link
Contributor

rhc54 commented Jan 16, 2017

@marksantcroos Try configuring with --disable-pmix-dstore. My best guess is that the pmix shared memory system isn't cleaning up properly after each job and so you are eventually running out of shared memory space.

@marksantcroos
Copy link
Contributor Author

Thanks, will try. That actually sounds promising, as I get the same error immediately if I restart the DVM!

@rhc54
Copy link
Contributor

rhc54 commented Jan 16, 2017

Check your /tmp area - you may have to rm -rf some session dir(s) to clean up. Look for anything that starts with ompi or pmix.

@marksantcroos
Copy link
Contributor Author

Yes, I noticed that also indeed. But I run out of them per run already.

@marksantcroos
Copy link
Contributor Author

With --disable-pmix-dstore the issue is indeed gone. It is also much much faster! What do I loose by not using the pmix dstore?

@rhc54
Copy link
Contributor

rhc54 commented Jan 16, 2017

you gain a little memory footprint, but that's all - however, it should have been faster, not slower. we'll have to investigate.

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2017

@marksantcroos Can you tell if it slows down as the number of jobs grows? I'm wondering if it is faster at first, but then slows down - this could potentially be a consequence of the cleanup problem.

@ggouaillardet
Copy link
Contributor

@marksantcroos @rhc54 i previously noted such an odd thing.

the root cause was /dev/shm was full, and for some reason, both ftruncate() and mmap() success (at least on Linux) but a crash occurs when accessing the mmap'ed memory

the attached program can be used to evidence this behavior

#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#ifndef ROOT
#define ROOT "/mnt"
#endif

#define SZE 4096
char buffer[SZE];

int main (int argc, char *argv) {
    char * filename = ROOT "/test";
    int fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0600);
    void *p;
    if (0 > fd) {
        perror("open ");
        exit(1);
    }
    if (0 > ftruncate(fd, SZE)) {
        perror("ftruncate ");
        exit(1);
    }
#if 0
    if (0 > write(fd, buffer, SZE)) {
        perror("write ");
        exit(1);
    }
#endif
    if (MAP_FAILED == (p = mmap(NULL, SZE,
                                PROT_READ | PROT_WRITE, MAP_SHARED,
                                fd, 0))) {
        perror("mmap ");
        exit(1);
     }
     printf("mmap'ed %d bytes at %p\n", SZE, p);
     memset(p, 0, SZE);
     printf("memset'ed %d mmap'ed bytes at %p\n", SZE, p);
     munmap(p, SZE);
     close(fd);
     unlink(filename);
     return 0;
}

then

# mount -t tmpfs -o size=1M none /mnt
# dd if=/dev/zero of=/mnt/big
# ./mmap

this causes a crash in memset()

a way (workaround ?) to detect this is to try to write() the full file after ftruncate() and before mmap()

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2017

@artpol84 @karasevb Can you guys take a look at this problem?

@marksantcroos
Copy link
Contributor Author

@marksantcroos Can you tell if it slows down as the number of jobs grows? I'm wondering if it is faster at first, but then slows down - this could potentially be a consequence of the cleanup problem.

Now that it works again I will go back to data gathering :-) Will report back.

@artpol84
Copy link
Contributor

Ok, we will check this out
I think I was observing this on my virtual cluster build on linux-containers. But @karasevb was unable to reproduce. We will take a closer look now.
@ggouaillardet thanks for the analysis!

@artpol84
Copy link
Contributor

@marksantcroos, am I understanding you correctly and the reason of repeated failures was in the bunch of dstore-related leftover in the /tmp directory?

@marksantcroos
Copy link
Contributor Author

@artpol84 Not exactly, let me explain my use case. I make use of a persistent DVM (that I start with orte-dvm) and then launch many (often small and short) tasks to that. So they were not "leftovers" from previous runs, but files from the same DVM instance, but from completed tasks. Does that make sense?

@artpol84
Copy link
Contributor

@marksantcroos, thank you - it does makes sense. We haven't checked this usecase, but we will do that now.
@rhc54 @marksantcroos can you provide short guidance on how to use the dvm mode to speedup the debugging process? Maybe anonymized script snippet?

@ggouaillardet
Copy link
Contributor

@marksantcroos when you are running your test, can you monitor /dev/shm and see how the empty space evolves ?

with plain mpirun, i run into this issue with a tiny /dev/shm when invoking MPI_Comm_spawn multiple times with the same mpirun command.
the use of a persistent DVM involves creating multiple PMIx jobs, and i am afraid the shared dstore grows forever and is only cleaned when the mpirun or persistent DVM completes.

fwiw, i already plugged numerous memory leaks in https://github.com/ggouaillardet/ompi/tree/topic/finalize_leaks
i still need to clean this up, issue a separate PR for pmix2x specific leaks and merge into master.

@marksantcroos
Copy link
Contributor Author

@artpol84 Sure, its rather straight forward, the minimal use-case is as follows:

orte-dvm --report-uri dvm_uri &
[ wait until "DVM ready" message ...]
for i in `seq 42`; do orterun --hnp file:dvm_uri -np 1 /bin/date; done

In general I'm happy to make people aware of our usage mode so that it gets considered earlier ;-)

@artpol84
Copy link
Contributor

@ggouaillardet dstore was re-designed to support job termination cleanup. So this case supposed to be handled correctly. However there might be an implementation issue. We are checking.

@marksantcroos
Copy link
Contributor Author

@ggouaillardet When I monitor /tmp/scratch/ompi*/jf.*/pmix_dstor*/" on the compute node, I see the initial-pmix_shared-segment* entries increasing over time, which only get removed once I close the dvm.

total 8332
drwx------ 2 12063 22864     800 Jan 17 03:08 .
drwx------ 3 12063 22864      60 Jan 17 03:06 ..
-rw------- 1 12063 22864       0 Jan 17 03:06 dstore_sm.lock
-rw------- 1 12063 22864    4096 Jan 17 03:06 initial-pmix_shared-segment-0
-rw------- 1 12063 22864    4096 Jan 17 03:06 initial-pmix_shared-segment-1
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-10
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-11
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-12
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-13
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-14
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-15
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-16
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-17
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-18
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-19
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-2
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-20
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-21
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-22
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-23
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-24
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-25
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-26
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-27
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-28
-rw------- 1 12063 22864    4096 Jan 17 03:08 initial-pmix_shared-segment-29
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-3
-rw------- 1 12063 22864    4096 Jan 17 03:08 initial-pmix_shared-segment-30
-rw------- 1 12063 22864    4096 Jan 17 03:08 initial-pmix_shared-segment-31
-rw------- 1 12063 22864    4096 Jan 17 03:08 initial-pmix_shared-segment-32
-rw------- 1 12063 22864    4096 Jan 17 03:08 initial-pmix_shared-segment-33
-rw------- 1 12063 22864    4096 Jan 17 03:08 initial-pmix_shared-segment-34
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-4
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-5
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-6
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-7
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-8
-rw------- 1 12063 22864    4096 Jan 17 03:07 initial-pmix_shared-segment-9
-rw------- 1 12063 22864 4194304 Jan 17 03:08 smdataseg-633668069-0
-rw------- 1 12063 22864 4194304 Jan 17 03:08 smseg-633668069-0

@artpol84
Copy link
Contributor

@marksantcroos this is a good input, thank you. @karasevb please reproduce and work to resolve.

@marksantcroos can you also tell how you measure that without dstore it is "much much" faster. Is this a visual observation or do you somehow measure the performance? I guess (as @rhc54 already mentioned) if memory get's filled up you may probably see the slowdown because of that. But still I'd like to know the way you evaluate performance to reproduce. Once fixed - will it be possible to re-evaluate in your environment?

@artpol84
Copy link
Contributor

@marksantcroos also what is your codebase? master or v2.x?

@marksantcroos
Copy link
Contributor Author

@artpol84 This is with the latest master. Performance was perception based, will actually measure to see if that was correct.

@artpol84
Copy link
Contributor

@marksantcroos thank you.
In case you'll confirm the performance degradation of the dstore, could you please also check as @rhc54 suggested the case where you have small number of tasks versus the case where you have a lot of them to see if this is a memory overwhelming that causing a problem or not.

@marksantcroos
Copy link
Contributor Author

image

@marksantcroos
Copy link
Contributor Author

Both DSTORE and NO-DSTORE started with a fresh DVM. Both increase gradually. Offset confirms my perception yesterday.

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2017

@marksantcroos Can you please confirm your environment for us? Was this done on Titan?

@marksantcroos
Copy link
Contributor Author

@rhc54 Yes, this was on Titan.

@artpol84
Copy link
Contributor

@marksantcroos can you explain in more about your plot?
What does "consecutive run" mean?

@marksantcroos
Copy link
Contributor Author

@artpol84 I started the DVM once for each configuration. And then repeatedly ran my program that fires 100 tasks to the DVM (repetitions on the X-axis). So I measure how long it takes to execute these 100 /bin/date tasks and thats what you see on the Y-axis.

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2017

@artpol84 We discussed this a bit on the call this morning, so let me capture those thoughts here. The workload in this use-case is very different from the ones we normally encounter. Instead of having one large job that has lots of procs/node, this workload has many small jobs that consist on only a few (often, one) process.

In this use-case, there really isn't going to be any benefit from dstore because there are no multiple procs/node sharing the information. So we see the overhead of creating all these shared memory segments, but get no benefit from them.

I would therefore not worry about the performance difference here. We just need to document that dstore should be disabled for this type of workload, and make it possible to do so via an MCA param at runtime instead of during configure (just to make life easier for users). We'll work on that over in the PMIx side of the house.

I think the one thing that does, however, need addressing here is the cleanup problem as that can/will impact long-running RMs.

@artpol84
Copy link
Contributor

@marksantcroos I see, thank you. Out of curiosity - what if you will vary number of /bin/true runs? I.e. from 2 to 128?

@rhc54 thanks, this makes sense.
Currently we have partially introduced "session" notation in the dstore. Right now all tasks from the same userid at the persistent orted are counted as belonging to the same session - this is why we have this leftover.
However session was originally intended for for example for MPI_Spawned or MPI_connect/Accept cases when multiple jobs have some relations and need to communicate. So with a little help from RM we can force dstore to create independent sessions for each independent task like in the case that @marksantcroos is exercising. In this case we will cleanup segments once all namespaces was deregistered.
What we will need from RM a session number passed through the info keys at PMIx_Server_register_nspace. dstore will use it to understand whether any two namespaces are related or not. If I recall correctly in OMPI this may be a job family number. I'd prefer this session ID to be a n integer but if any concerns it can be string as well.

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2017

Sounds reasonable to me - I see no problem providing an integer or string, so we can pick whatever you like. We just need to add it to the list of RM-required data, and then protect ourselves by defaulting to assuming single session if it isn't provided.

@artpol84
Copy link
Contributor

Well I think it's ok to default to a single session as some of RMs will support only that, SLURM for example. But if some particular RM knows that there will be multiple independent jobs - it is RMs responsibility to manage session ID's and provide them to PMIx_Server_register_nspace()

@artpol84
Copy link
Contributor

Those session ID's by the way will have only node-local (even server-local) meaning, so I think it is safe to use integers for that.

@artpol84
Copy link
Contributor

artpol84 commented Jan 17, 2017

So all we need to do from the PMIx API perspective is to provide a legacy info key name for this session ID value.
And finish support of those sessions in the dstore, which should be very simple as all the infrastructure is there already.
And we need to reserve default session ID so dstore will be able to use it in case nothing was provided explicitly.

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2017

Agreed! I can implement this in OMPI master for you so we can try it out - will try to have it for you in the next day.

@artpol84
Copy link
Contributor

Great, thank you!

@karasevb
Copy link
Member

The cause of increasing the dstore files of segments is wrong use the sessions into PMIx. The orte-dvm creates the PMIx name space for some needs and not releases it in during work. And each orterun initializes own nspace, it must to clean the own files of segments. But It not happens, because the PMIx implementation currently use only one PMIx-session, and it shared the orte-dvm session with next nspaces which were initialized in orterun. So it can accumulate the data from each orterun in one shared session. This session (and all files of segments) will realised when orte-dvm stoped.

@rhc54
Copy link
Contributor

rhc54 commented May 29, 2017

@marksantcroos Can you confirm if this is still happening? If so, I'll try to address.

@rhc54
Copy link
Contributor

rhc54 commented May 30, 2017

I checked the current OMPI master and found that the dstore space is getting cleaned up after each job - I am not finding any leftover entries in /dev/shm or in the /tmp/ompi* session directory tree. I am therefore closing this issue for now - we can reopen if/when this problem is seen again.

@rhc54 rhc54 closed this as completed May 30, 2017
@marksantcroos
Copy link
Contributor Author

Hi Ralph, its on my list to verify. Have been running with disabled dstore for a while. Took a bit longer as I also wanted to compare the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants