Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RP 0.45.RC1 failure on comet #1212

Closed
vivek-bala opened this issue Feb 13, 2017 · 10 comments
Closed

RP 0.45.RC1 failure on comet #1212

vivek-bala opened this issue Feb 13, 2017 · 10 comments
Assignees
Milestone

Comments

@vivek-bala
Copy link
Contributor

Script: 00_getting_started.py
Resource: comet
Full log: https://gist.github.com/vivek-bala/0d0d4ee7662c2e8d2ed9ebb744cc295e

Interessting part of the log:

2017-02-13 17:41:59,081: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : Traceback (most recent call last):
  File "/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 787, in find
    data += self.read (timeout=_POLLDELAY)
  File "/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 679, in read
    % (e, self.tail))
NoSuccess: read from process failed '[Errno 5] Input/output error' : (ssh: Could not resolve hostname comet.sdsc.xsede.org: Name or service not known
) (/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_process.py +679 (read)  :  % (e, self.tail)))

2017-02-13 17:41:59,135: radical.saga.pty    : MainProcess                     : MainThread     : ERROR   : read from process failed '[Errno 5] Input/output error' : (ssh: Could not resolve hostname comet.sdsc.xsede.org: Name or service not known
) ((ssh: Could not resolve hostname comet.sdsc.xsede.org: Name or service not known
)) (/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_exceptions.py +40 (translate_exception)  :  e = se.BadParameter (cmsg))
Traceback (most recent call last):
  File "/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 263, in _initialize_pty
    n, match = pty_shell.find (prompt_patterns, delay)
  File "/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 790, in find
    raise ptye.translate_exception (e, "(%s)" % data)
BadParameter: read from process failed '[Errno 5] Input/output error' : (ssh: Could not resolve hostname comet.sdsc.xsede.org: Name or service not known
) ((ssh: Could not resolve hostname comet.sdsc.xsede.org: Name or service not known
)) (/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_exceptions.py +40 (translate_exception)  :  e = se.BadParameter (cmsg))
caught Exception: read from process failed '[Errno 5] Input/output error' : (ssh: Could not resolve hostname comet.sdsc.xsede.org: Name or service not known
) ((ssh: Could not resolve hostname comet.sdsc.xsede.org: Name or service not known
)) (/home/vivek/Research/ves/test_rp/local/lib/python2.7/site-packages/saga/utils/pty_exceptions.py +40 (translate_exception)  :  e = se.BadParameter (cmsg))

This error is reproducible. The sandbox on comet just contains only the bootstrap_1.sh script.

@andre-merzky
Copy link
Member

That is a strange one though: can you try to run the following on command line, please:

$ host comet.sdsc.xsede.org
comet.sdsc.xsede.org is an alias for comet.sdsc.edu.
comet.sdsc.edu has address 198.202.113.252
comet.sdsc.edu has address 198.202.113.253

If you see the same, please run the test again. If you see something different, please post the output.

Thanks!

@vivek-bala
Copy link
Contributor Author

$ host comet.sdsc.xsede.org
comet.sdsc.xsede.org is an alias for comet.sdsc.edu.
comet.sdsc.edu has address 198.202.113.253
comet.sdsc.edu has address 198.202.113.252
comet.sdsc.edu mail is handled by 10 postal.sdsc.edu.

@vivek-bala
Copy link
Contributor Author

I get the same error.

@andre-merzky
Copy link
Member

From what machine are you running, Vivek? Can you try to run from the radical server, and compare results? I am not saying this should not work out of the box - but this looks more like a network issue than anything else to me?

@vivek-bala
Copy link
Contributor Author

Hmmm. Sure I'll try it out from our VM.

@vivek-bala
Copy link
Contributor Author

vivek-bala commented Feb 14, 2017

From the radical VM, I don't see the above error.

But, the CUs start executing and then fail with the following error:

/home/marksant/openmpi/installed/rhc/bin/orterun: Error: unknown option "--hnp"
Type '/home/marksant/openmpi/installed/rhc/bin/orterun --help' for usage.

@andre-merzky
Copy link
Member

Hmm, that is an ORTE deployment error it seems, and indeed different from what you had before. Can you please open a new ticket for this one? I'll need to look into it. Thanks!

Not sure how to proceed with your original problem: your logs indicate that the ssh login to comet works well, but that the file transfer stalls and times out. Let me cobble up a couple of commands to run some tests...

@vivek-bala
Copy link
Contributor Author

I opened a new ticket for the orte failure (#1218). Let me know how I can debug the original error.

@andre-merzky
Copy link
Member

For the original error: can you please set RADICAL_SAGA_PTY_VERBOSE=DEBUG, create a new logfile, and attach it? Thanks!

@vivek-bala
Copy link
Contributor Author

Ok, I can't reproduce this error anymore. I'll close this for now. Will reopen if I run into this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants