Description
Spawning many kernerls in a short lapse of time may result in ZMQError because one of the kernel tries to use a port already in use by another kernel.
This is due to the current implementation of jupyter_client: after free ports have been found, they are dumped in a connection file that will be passed to the kernel that the client will start.
The problem is that we might search for free ports after creating the connection file but before starting the kernel (when restoring a session in Jupyter Lab, or spawning multi kernels quickly in Voila for instance). Since the first kernel has not started yet, the ports are still free and jupyter_client
might write a connection file for the next kernel to start with the same ports as in the first connection file. Therefore two kernels will attempt to use the same ports.
Even if we can fix this issue in Jupyter Lab and Voilà (by searching free ports for all kernels first, and then writing all the connection files at once), this does not prevent other applications (unrelated to the Jupyter project) to start and use the port written in the connection file before the kernel has started.
A solution would be to always let the kernel find free ports and communicate them to the client (kind of handshaking pattern):
- The client opens a socket A, passes the port of this socket to the kernel that it launches and waits
- the kernel starts, finds free ports to bind shell, control, stdin, heartbeat and iopub sockets. Then it connects to the socket A of the client, sends a message containing these ports, and close the connection to socket A.
- Upon reception of this message, the client connects to the kernel and closes the socket A.
I am aware that this requires significant changes in the kernel protocol and the implementation of a lot of kernels, but I do not see a better solution to this issue.
cc @vidartf and @martinRenou who have been discussing this issue in Voila