Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVM startup failed #1237

Closed
andre-merzky opened this issue Feb 20, 2017 · 14 comments
Closed

DVM startup failed #1237

andre-merzky opened this issue Feb 20, 2017 · 14 comments
Assignees
Milestone

Comments

@andre-merzky
Copy link
Member

This is during test of v0.45.RC2 on ornl.titan, example 00:

 24 2017-02-20 11:00:09,080: agent_0.bootstrap_3 : MainProcess                     : MainThread     : INFO    : Starting ORTE DVM on 1 nodes with '/usr/bin/stdbuf -oL /lustre/atlas/world-shared/bip103/openmpi/s    tatic-nodebug/bin/orte-dvm' ...                                                                                                                                                                               
 25 2017-02-20 11:00:09,178: agent_0.bootstrap_3 : MainProcess                     : MainThread     : INFO    : ORTE DVM URI: 2010382336.0;tcp://10.128.0.13,160.91.205.243:39421                                 
 26 2017-02-20 11:11:23,340: agent_0.bootstrap_3 : MainProcess                     : DVMWatcher     : INFO    : DVM stopped (1)                                                                                   

Note that titan seems unusually slow right now, so I am not sure if the machine is healthy. What I am missing so is any indication on DVM status and health, we need some logging there...

@andre-merzky
Copy link
Member Author

Excellect - thanks!

@ibethune
Copy link
Contributor

Any indication of whether this is just a Titan blip in which case the ticket is for more logging (next release) or is it repeatable? Certainly Titan was working OK during the RC1 testing.

@andre-merzky
Copy link
Member Author

No idea right now, really, but I am looking into this today.

@ibethune
Copy link
Contributor

@andre-merzky any update on this one?

@andre-merzky
Copy link
Member Author

Sorry that it took me so long to see this, but the DVM was started twice in theornl.titan_lib configuration: once for the ORTE LM as agent startup method, once for ORTE_LIB LM as unit launch method. I am pretty sure that we fixed that at some point in the past (by setting the agent startup to FORK/POPEN, which is the only correct choice anyway) - either way, it sneaked back in here.

There is of course the possibility that this masked different errors, since we saw different error modes than DMV startup - but those should get separate tickets now.

@marksantcroos
Copy link
Contributor

This is not the correct fix. See 1968501.

@marksantcroos
Copy link
Contributor

Note that we didn't backport this as it was said that all users of titan_lib would be using the split_module branch anyway.

@andre-merzky
Copy link
Member Author

Right, that makes sense. Well, that patch applies with only a minor conflict, so I am impartial to backporting or not...

@ibethune ibethune modified the milestones: Future Release, 0.45 Feb 27, 2017
@ibethune
Copy link
Contributor

So this ticket should be looked at in the context of the next release to check whether (a) that the actual dual-startup problem is resolved, and (b) consider if additional tracing around DVM startup should be made default.

@ibethune ibethune modified the milestones: 0.46, Future Release Mar 2, 2017
@andre-merzky
Copy link
Member Author

#1277 is currently tested in this context

@andre-merzky
Copy link
Member Author

We seem to have a working configuration on titan by now, including a fix to the double-startup problem. Is anybody opposed to closing this ticket?

@marksantcroos
Copy link
Contributor

I would say that this issue only applied to the old situation.

@andre-merzky
Copy link
Member Author

Thanks Mark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants