-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pmix dstore esh bus error in orted on cray in dvm mode #2737
Comments
@marksantcroos Try configuring with |
Thanks, will try. That actually sounds promising, as I get the same error immediately if I restart the DVM! |
Check your /tmp area - you may have to |
Yes, I noticed that also indeed. But I run out of them per run already. |
With |
you gain a little memory footprint, but that's all - however, it should have been faster, not slower. we'll have to investigate. |
@marksantcroos Can you tell if it slows down as the number of jobs grows? I'm wondering if it is faster at first, but then slows down - this could potentially be a consequence of the cleanup problem. |
@marksantcroos @rhc54 i previously noted such an odd thing. the root cause was the attached program can be used to evidence this behavior #include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#ifndef ROOT
#define ROOT "/mnt"
#endif
#define SZE 4096
char buffer[SZE];
int main (int argc, char *argv) {
char * filename = ROOT "/test";
int fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0600);
void *p;
if (0 > fd) {
perror("open ");
exit(1);
}
if (0 > ftruncate(fd, SZE)) {
perror("ftruncate ");
exit(1);
}
#if 0
if (0 > write(fd, buffer, SZE)) {
perror("write ");
exit(1);
}
#endif
if (MAP_FAILED == (p = mmap(NULL, SZE,
PROT_READ | PROT_WRITE, MAP_SHARED,
fd, 0))) {
perror("mmap ");
exit(1);
}
printf("mmap'ed %d bytes at %p\n", SZE, p);
memset(p, 0, SZE);
printf("memset'ed %d mmap'ed bytes at %p\n", SZE, p);
munmap(p, SZE);
close(fd);
unlink(filename);
return 0;
} then
this causes a crash in a way (workaround ?) to detect this is to try to |
Now that it works again I will go back to data gathering :-) Will report back. |
Ok, we will check this out |
@marksantcroos, am I understanding you correctly and the reason of repeated failures was in the bunch of dstore-related leftover in the /tmp directory? |
@artpol84 Not exactly, let me explain my use case. I make use of a persistent DVM (that I start with |
@marksantcroos, thank you - it does makes sense. We haven't checked this usecase, but we will do that now. |
@marksantcroos when you are running your test, can you monitor with plain fwiw, i already plugged numerous memory leaks in https://github.com/ggouaillardet/ompi/tree/topic/finalize_leaks |
@artpol84 Sure, its rather straight forward, the minimal use-case is as follows:
In general I'm happy to make people aware of our usage mode so that it gets considered earlier ;-) |
@ggouaillardet dstore was re-designed to support job termination cleanup. So this case supposed to be handled correctly. However there might be an implementation issue. We are checking. |
@ggouaillardet When I monitor
|
@marksantcroos this is a good input, thank you. @karasevb please reproduce and work to resolve. @marksantcroos can you also tell how you measure that without dstore it is "much much" faster. Is this a visual observation or do you somehow measure the performance? I guess (as @rhc54 already mentioned) if memory get's filled up you may probably see the slowdown because of that. But still I'd like to know the way you evaluate performance to reproduce. Once fixed - will it be possible to re-evaluate in your environment? |
@marksantcroos also what is your codebase? master or v2.x? |
@artpol84 This is with the latest master. Performance was perception based, will actually measure to see if that was correct. |
@marksantcroos thank you. |
Both |
@marksantcroos Can you please confirm your environment for us? Was this done on Titan? |
@rhc54 Yes, this was on Titan. |
@marksantcroos can you explain in more about your plot? |
@artpol84 I started the DVM once for each configuration. And then repeatedly ran my program that fires 100 tasks to the DVM (repetitions on the X-axis). So I measure how long it takes to execute these 100 /bin/date tasks and thats what you see on the Y-axis. |
@artpol84 We discussed this a bit on the call this morning, so let me capture those thoughts here. The workload in this use-case is very different from the ones we normally encounter. Instead of having one large job that has lots of procs/node, this workload has many small jobs that consist on only a few (often, one) process. In this use-case, there really isn't going to be any benefit from dstore because there are no multiple procs/node sharing the information. So we see the overhead of creating all these shared memory segments, but get no benefit from them. I would therefore not worry about the performance difference here. We just need to document that dstore should be disabled for this type of workload, and make it possible to do so via an MCA param at runtime instead of during configure (just to make life easier for users). We'll work on that over in the PMIx side of the house. I think the one thing that does, however, need addressing here is the cleanup problem as that can/will impact long-running RMs. |
@marksantcroos I see, thank you. Out of curiosity - what if you will vary number of /bin/true runs? I.e. from 2 to 128? @rhc54 thanks, this makes sense. |
Sounds reasonable to me - I see no problem providing an integer or string, so we can pick whatever you like. We just need to add it to the list of RM-required data, and then protect ourselves by defaulting to assuming single session if it isn't provided. |
Well I think it's ok to default to a single session as some of RMs will support only that, SLURM for example. But if some particular RM knows that there will be multiple independent jobs - it is RMs responsibility to manage session ID's and provide them to |
Those session ID's by the way will have only node-local (even server-local) meaning, so I think it is safe to use integers for that. |
So all we need to do from the PMIx API perspective is to provide a legacy info key name for this session ID value. |
Agreed! I can implement this in OMPI master for you so we can try it out - will try to have it for you in the next day. |
Great, thank you! |
The cause of increasing the |
@marksantcroos Can you confirm if this is still happening? If so, I'll try to address. |
I checked the current OMPI master and found that the dstore space is getting cleaned up after each job - I am not finding any leftover entries in /dev/shm or in the /tmp/ompi* session directory tree. I am therefore closing this issue for now - we can reopen if/when this problem is seen again. |
Hi Ralph, its on my list to verify. Have been running with disabled dstore for a while. Took a bit longer as I also wanted to compare the performance. |
With the latest master, in dvm mode, after running around a couple of thousand tasks I repeatedly run into the following:
Will dig further, but increasing the set of eyes looking at it.
The text was updated successfully, but these errors were encountered: