From d2304a918132ada6320503d69c7a2327e3d0cfcc Mon Sep 17 00:00:00 2001 From: vsoch Date: Mon, 6 Feb 2023 12:24:35 -0700 Subject: [PATCH] add tutorial to connect to flux between clusters this uses a proxy jump in the ssh config. Signed-off-by: vsoch --- tutorials/commands/index.rst | 18 +++ tutorials/commands/ssh-across-clusters.rst | 157 +++++++++++++++++++++ tutorials/index.rst | 1 + tutorials/lab/coral.rst | 7 +- 4 files changed, 181 insertions(+), 2 deletions(-) create mode 100644 tutorials/commands/index.rst create mode 100644 tutorials/commands/ssh-across-clusters.rst diff --git a/tutorials/commands/index.rst b/tutorials/commands/index.rst new file mode 100644 index 00000000..5c1e58b7 --- /dev/null +++ b/tutorials/commands/index.rst @@ -0,0 +1,18 @@ +.. _command-tutorials: + +Command Tutorials +================= + +Welcome to the Command Tutorials! These tutorials should help you to map specific Flux commands +with your use case, and then see detailed usage. + + - ``flux proxy`` (:ref:`ssh-across-clusters`): "Send commands to a flux instance across clusters using ssh" + +This section is currently 🚧️ under construction 🚧️, so please come back later to see more command tutorials! + + +.. toctree:: + :maxdepth: 2 + :caption: Command Tutorials + + ssh-across-clusters \ No newline at end of file diff --git a/tutorials/commands/ssh-across-clusters.rst b/tutorials/commands/ssh-across-clusters.rst new file mode 100644 index 00000000..06a47dde --- /dev/null +++ b/tutorials/commands/ssh-across-clusters.rst @@ -0,0 +1,157 @@ +.. _ssh-across-clusters: + +=================== +SSH across clusters +=================== + +Let's say you want to create a Flux instance in an allocation on a cluster (e.g., let's say out first cluster is "noodle") 🍜️ +and then connect to it via ssh from another cluster (let's say our second cluster is called "quartz"). This is possible with the right +setup of your ``~/.ssh/config``. + +---------------------- +Create a Flux Instance +---------------------- + +First, let's create the allocation on the first cluster. We typically want to ask for an allocation, +and run flux start via our job manager. Here we might be on a login node: + +.. code-block:: sh + + # slurm specific + $ salloc -N4 --exclusive + $ srun -N4 -n4 --pty --mpibind=off flux start + +And then we get our allocation! +As a sanity check, once you are on one of the nodes you should be able to submit a job and see the output: + +.. code-block:: sh + + noodle:~$ flux mini run hostname + noodle220 + noodle221 + noodle222 + noodle223 + +And you might adopt this command to be more specific to your resource manager. E.g., slurm uses srun. +After you run flux start, you are inside of a Flux instance! We generally want to launch +this instance as a parallel job under the native resource manager, rather than get an allocation and just run flux start. +The reason is because the test instance ignores native resource allocation and starts N brokers in place. +You can sanity check the resources you have within the instance by then running: + +.. code-block:: sh + + noodle:~$ flux resource list + STATE NNODES NCORES NGPUS NODELIST + free 4 160 0 noodle[220,221,222,223] + allocated 0 0 0 + down 0 0 0 + + +And you can echo ``$FLUX_URI`` to see the path of the socket that you will also need later: + +.. code-block:: sh + + noodle:~$ echo $FLUX_URI + local:///var/tmp/flux-MLmxy2/local-0 + +We now have defined a goal for success - getting this listing working by running a command +from a different cluster node. + +----------------------- +Connect to the Instance +----------------------- + +Next, let's ssh into another cluster. Take the hostname where your instance is running, +and create a `proxy jump `_ in your ``~/.ssh/config``: + +.. code-block:: ssh + + Host noodle + HostName noodle + + Host noodle220 + hostname noodle220 + ProxyJump noodle + +.. note:: + + This ``~/.ssh/config`` needs to be written on the cluster system where you are going to connect from. + In many cases, the shared filesystem could map your home across clusters so you can see the file in + multiple places. + + +You'll first need to tell Flux to use ssh for the proxy command: + +.. code-block:: ssh + + quartz:~$ export FLUX_SSH=ssh + +Next, from this same location, try using ``flux proxy`` to connect to your Flux Instance! Target the URI +that you found before, ``local:///var/tmp/flux-MLmxy2/local-0``, and add the hostname ``noodle220`` to the address: + +.. code-block:: sh + + quartz:~$ flux proxy ssh://noodle220/var/tmp/flux-MLmxy2/local-0 + +If you have trouble - use the force! + +.. code-block:: sh + + quartz:~$ flux proxy --force ssh://noodle220/var/tmp/flux-MLmxy2/local-0 + + +You should then be able to run the same resource list: + +.. code-block:: sh + + quartz:~$ flux resource list + STATE NNODES NCORES NGPUS NODELIST + free 4 160 0 noodle[220,221,222,223] + allocated 0 0 0 + down 0 0 0 + +Next, try submitting a job! You should be able to see that you are running on the first cluster, +but from the second. + +.. code-block:: sh + + quartz:~$ flux mini run hostname + noodle220 + +If you are still connected to the first, you should also be able to query the jobs. +E.g., here we submit a sleep from the second connected cluster: + +.. code-block:: sh + + quartz:~$ flux mini submit sleep 60 + f22hdyb35 + +And then see it from either cluster node! + +.. code-block:: sh + + $ flux jobs | jq + { + "id": 2272725565440, + "userid": 34633, + "urgency": 16, + "priority": 16, + "t_submit": 1675713045.009863, + "state": 16, + "name": "sleep", + "ntasks": 1, + "nnodes": 1, + "ranks": "2", + "nodelist": "noodle220", + "expiration": 1676317845, + "t_depend": 1675713045.009863, + "t_run": 1675713045.0290241, + "annotations": { + "sched": { + "queue": "default" + } + } + } + +And that's it! With this strategy, it should be easy to interact with Flux instances from +two resources where ssh is supported. If you have any questions, please `let us know `_. diff --git a/tutorials/index.rst b/tutorials/index.rst index cbc950dd..1fb43e5a 100644 --- a/tutorials/index.rst +++ b/tutorials/index.rst @@ -13,3 +13,4 @@ find a tutorial of interest. lab/index integrations/index + commands/index \ No newline at end of file diff --git a/tutorials/lab/coral.rst b/tutorials/lab/coral.rst index ab5513d4..3c217e0b 100644 --- a/tutorials/lab/coral.rst +++ b/tutorials/lab/coral.rst @@ -24,9 +24,12 @@ If you are using the ORNL system Summit, run: module use /sw/summit/modulefiles/ums/gen007flux/linux-rhel8-ppc64le/Core ------------------- + +.. _launch-flux-on-lassen: + +-------------- Launching Flux ------------------- +-------------- You can load the latest Flux-team managed installation on LLNL and ORNL CORAL machines using: