-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select singleserver port number on remote host #58
Conversation
Hi @cmd-ntrf have you tried to set up a hub-managed service to extend the API? We've done some experiments around that because we have the same issue with "eager" port selection. I think the Spawner has to be customized to get the port info back to the Hub from the service. |
Hi, thanks for this work! I tend to agree with you that opening a socket isn't very Jupyter-esque. The "Jupyter" solution would probably be to create a web api callback, but that might be a bit overkill here. I see two easy and batchspawner-esque possible solutions here: A) Write our own version of B) Extend the Either way I would ask for your patience in not accepting this PR quite yet. Various Jupyter people have requested for some time that we put out a proper Batchspawner release, so I think we need to get that sorted out first before developing new functionality. |
61d0ab0
to
1c6834d
Compare
Hi @rcthomas and @mbmilligan, Thanks for the quick response. Following @rcthomas response, I dug on how the port configuration could be handled using a rest API and before @mbmilligan I was already down the rabbit hole... so I wrote an API Handler for BatchSpawner. I did not used a hub-managed service. I created a APIHandler that wait for post of the notebook port number and added it to JupyterHub handlers list. To my surprise, it is actually simpler than my first socket solution. It also has the advantage of being somewhat secure has the post can only be done if the user is authenticated. I understand the need to freeze the code for a release before adding a new feature. On my side, I will deploy the API Handler solution and sees how it goes. I will keep this thread updated if I face any issue. |
After some good conversations at the PEARC conference, I think the consensus is that we should go ahead and integrate this in Batchspawner for now, and in parallel pursue getting an API added to Jupyterhub core. I think it's also about time to put together another release, so let's do that and tag this PR as one that we want to get into good shape for that. |
Good! I have updated the PR last week to integrate most recent changes made to batchspawner. However, the tests are still failing. I am willing to help fix them, but I will need some guidance. |
I have updated this PR to allow user to set the port value instead of forcing it to random. This should also work with the port range PR. I have also updated the test to fix the spawner port value. Regarding tests:
|
Avoid error when notebook is not installed with JupyterHub
Now that we have gotten the latest round of testing issues resolved, I can merge this into master. Please note that before the next release we need some documentation added to the README or elsewhere. The fact that users need to install a different |
user = self.get_current_user() | ||
data = self.get_json_body() | ||
port = int(data.get('port', 0)) | ||
user.spawner.current_port = port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using wrapspawner
, this fails because user.spawner.current_port
needs to be proxied to user.spawner.child_spawner.current_port
. I made a quick fix to wrapspawner
to proxy it there, but we can ask what's the best place to do this?
When I made wrapspawner
proxy all attributes with getattr, it fails some other way which I haven't understood yet.
The main question is: should wrapspawner or batchspawner be responsible for this? I would think wrapspawner, and how do we then avoid having to special case everything needed? I'll return to this later.
We have the same issue as you have encountered. Although I am no expert at this, but I know we are running latest jupyterhub version. Is your fixes incorporated at the latest version? Or do we need to pull the changes seperately and merge them to the latest version? Can you please take a min to help us with some pointers? |
Context
We provide compute nodes on our GPU cluster for a graduate deep learning course through JupyterHub and BatchSpawner. The nodes are available two hours a week under a reservation for the course (a lab period). Each node has 8 GPUs, a student is allocated one GPU and there are around 30 students running Notebook at the same time. Therefore, multiple jupyterhub-singleserver can run on a single compute node. Until last week, it was working flawlessly.
Problem
During last week lab period, two users reported being unable to connect to their notebook. After inspecting their notebook log, I found this message:
The users could not access their notebook, because the singleserver could not start. The singleserver could not start because it was assigned a port number that was already used on the compute, probable by another singleserver used by another student.
The port generation for the singleserver is done on the BatchSpawner side (batchpawner.py:272-278). The function used to generate the random port is
jupyterhub.utils.random_port
. The function content is reproduced here to help understanding the issue:random_port
create a socket locally, on the Hub/Spawner side, retrieves the port number and close the socket. Once the function has closed the socket, the port number is available again since nothing on the Hub side is bound to it andrandom_port
could return the same port number when called again. The randomness of the function depends on the kernel handling of ephemeral port numbers. Furthermore, the function is only for local ports. There is no guarantee that an ephemeral port available for the Hub will be available for the compute and this the main issue with using this function to set the remote singleserver port.Solution
Our team brainstormed on possible solutions that were looking at limiting the risk of port number collisions : hashing the job id, increasing the range from ephemeral port to all user available ports, etc, but they all had the same problem : it meant deciding of the port number on the Hub side, thus having no guarantee this port would be available on the compute and risking a job failure. We concluded that the singleserver port has to be selected remotely and sent back to the Hub/Spawner.
This PR fixes the port generation issue by letting the port number being generated by the singleserver and sent it back to the BatchSpawner through a BSD socket. The BSD socket address and port are provided to the singleserver by command-line arguments in the job script. To add the command-line arguments and the port syncing, this PR implements a batchspawner-singleserver script and app that inherits from SingleUserNotebookApp.
The port number is received by the spawner on the created socket before
BatchSpawner.start
returns.This solution has proven effective to solve our port number collision.
Issue with PR solution
Using a socket to communicate between the Spawner/Hub and the Notebook singleserver is not very jupyter-like. It works for our use case, but could be problematic when working in an environment where there is a firewall between the compute nodes and the Hub, since we use a random port number to communicate between the compute node and the Spawner.
There is also no validation that the data received by the Spawner is truly a port number sent by the right compute node.
Ideally, I think the selected port should be communicated back to the Hub through the REST API, but I am uncertain of what it implies and how to properly implement it. Therefore, I think this PR should be accepted as is, but it should be used as the beginning of a solution to the aforementioned problem. I am willing to implement the right solution once we have converged on the proper way to do it.