Open
Description
I wish there were some list of cluster configuration quirks (that are not job scheduler specific) and possible work-arounds (when there are some) somewhere in the doc (I was not aware of limitation of TCP/IP connection limitations between login and compute nodes in some clusters until a few days ago). Here are a rough list off the top of my head:
- submit_command not available on the compute nodes, e.g. For loop never ending #333. Possible work-around: For loop never ending #333 (comment) (I never tried it myself). This is the case for all the OAR clusters I know about, i.e. the submit command is never available on the compute nodes so in principle I could test this idea.
- TCP/IP restrictions between login and compute nodes e.g. Dask JobQueue and TCP connections between login and compute nodes #354 and Specify a cluster scheduler listener port? #355. Possible work-around: start the main script / notebook in an interactive node with all the additional pain and limitations this entails, see Dask JobQueue and TCP connections between login and compute nodes #354 (comment) for the one I know about.
- non uniform network interfaces on login and compute nodes. I guess same work-around as TCP/IP restriction would work but not a great work-around.
Please add more if you know more off the top of your head.