Skip to content

[SPARK-2313] PySpark pass port rather than stdin #3424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lvsoft
Copy link

@lvsoft lvsoft commented Nov 24, 2014

This patch will fix [SPARK-2313].

It peek available free port number, and pass the port number to Py4j.Gateway for binding via command line argument.
The initial value of the port number is scanned beginning at the mod of PID, which could avoid potential concurrency issues such as supporting multiple PySpark instances in future. And the port number printed from Py4j in STDIN is also parsed for double check.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@davies
Copy link
Contributor

davies commented Nov 24, 2014

I think the motivation of SPARK-2313 is to remove the dependency of STDIN to return the port back to Python, just replace it by a socket may works (domain socket may don't work in Window?). There is race condition that the peeked free port will be occupied by other program.

So, the approach will be:

  1. bind to random socket in python,
  2. pass the port into JVM, connect to it
  3. Java Gateway binds to random port
  4. pass the port back via socket (created in 1)
  5. read the port from socket (created in 1), close it

@lvsoft
Copy link
Author

lvsoft commented Nov 25, 2014

I think this is a better solution.
However, pass the port back via socket will affair py4j too.
Currently, stdin is the only supported method in py4j to pass back the port number.

asfgit pushed a commit that referenced this pull request Feb 16, 2015
…hon driver

This patch changes PySpark so that the GatewayServer's port is communicated back to the Python process that launches it over a local socket instead of a pipe.  The old pipe-based approach was brittle and could fail if `spark-submit` printed unexpected to stdout.

To accomplish this, I wrote a custom `PythonGatewayServer.main()` function to use in place of Py4J's `GatewayServer.main()`.

Closes #3424.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4603 from JoshRosen/SPARK-2313 and squashes the following commits:

6a7740b [Josh Rosen] Remove EchoOutputThread since it's no longer needed
0db501f [Josh Rosen] Use select() so that we don't block if GatewayServer dies.
9bdb4b6 [Josh Rosen] Handle case where getListeningPort returns -1
3fb7ed1 [Josh Rosen] Remove stdout=PIPE
2458934 [Josh Rosen] Use underscore to mark env var. as private
d12c95d [Josh Rosen] Use Logging and Utils.tryOrExit()
e5f9730 [Josh Rosen] Wrap everything in a giant try-block
2f70689 [Josh Rosen] Use stdin PIPE to share fate with driver
8bf956e [Josh Rosen] Initial cut at passing Py4J gateway port back to driver via socket

(cherry picked from commit 0cfda84)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 0cfda84 Feb 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants