Skip to content

Commit

Permalink
add tutorial to connect to flux between clusters
Browse files Browse the repository at this point in the history
this uses a proxy jump in the ssh config.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Feb 9, 2023
1 parent 9d024d9 commit d2304a9
Show file tree
Hide file tree
Showing 4 changed files with 181 additions and 2 deletions.
18 changes: 18 additions & 0 deletions tutorials/commands/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
.. _command-tutorials:

Command Tutorials
=================

Welcome to the Command Tutorials! These tutorials should help you to map specific Flux commands
with your use case, and then see detailed usage.

- ``flux proxy`` (:ref:`ssh-across-clusters`): "Send commands to a flux instance across clusters using ssh"

This section is currently 🚧️ under construction 🚧️, so please come back later to see more command tutorials!


.. toctree::
:maxdepth: 2
:caption: Command Tutorials

ssh-across-clusters
157 changes: 157 additions & 0 deletions tutorials/commands/ssh-across-clusters.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
.. _ssh-across-clusters:

===================
SSH across clusters
===================

Let's say you want to create a Flux instance in an allocation on a cluster (e.g., let's say out first cluster is "noodle") 🍜️
and then connect to it via ssh from another cluster (let's say our second cluster is called "quartz"). This is possible with the right
setup of your ``~/.ssh/config``.

----------------------
Create a Flux Instance
----------------------

First, let's create the allocation on the first cluster. We typically want to ask for an allocation,
and run flux start via our job manager. Here we might be on a login node:

.. code-block:: sh
# slurm specific
$ salloc -N4 --exclusive
$ srun -N4 -n4 --pty --mpibind=off flux start
And then we get our allocation!
As a sanity check, once you are on one of the nodes you should be able to submit a job and see the output:

.. code-block:: sh
noodle:~$ flux mini run hostname
noodle220
noodle221
noodle222
noodle223
And you might adopt this command to be more specific to your resource manager. E.g., slurm uses srun.
After you run flux start, you are inside of a Flux instance! We generally want to launch
this instance as a parallel job under the native resource manager, rather than get an allocation and just run flux start.
The reason is because the test instance ignores native resource allocation and starts N brokers in place.
You can sanity check the resources you have within the instance by then running:

.. code-block:: sh
noodle:~$ flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 4 160 0 noodle[220,221,222,223]
allocated 0 0 0
down 0 0 0
And you can echo ``$FLUX_URI`` to see the path of the socket that you will also need later:

.. code-block:: sh
noodle:~$ echo $FLUX_URI
local:///var/tmp/flux-MLmxy2/local-0
We now have defined a goal for success - getting this listing working by running a command
from a different cluster node.

-----------------------
Connect to the Instance
-----------------------

Next, let's ssh into another cluster. Take the hostname where your instance is running,
and create a `proxy jump <https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts>`_ in your ``~/.ssh/config``:

.. code-block:: ssh
Host noodle
HostName noodle
Host noodle220
hostname noodle220
ProxyJump noodle
.. note::

This ``~/.ssh/config`` needs to be written on the cluster system where you are going to connect from.
In many cases, the shared filesystem could map your home across clusters so you can see the file in
multiple places.


You'll first need to tell Flux to use ssh for the proxy command:

.. code-block:: ssh
quartz:~$ export FLUX_SSH=ssh
Next, from this same location, try using ``flux proxy`` to connect to your Flux Instance! Target the URI
that you found before, ``local:///var/tmp/flux-MLmxy2/local-0``, and add the hostname ``noodle220`` to the address:

.. code-block:: sh
quartz:~$ flux proxy ssh://noodle220/var/tmp/flux-MLmxy2/local-0
If you have trouble - use the force!

.. code-block:: sh
quartz:~$ flux proxy --force ssh://noodle220/var/tmp/flux-MLmxy2/local-0
You should then be able to run the same resource list:

.. code-block:: sh
quartz:~$ flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 4 160 0 noodle[220,221,222,223]
allocated 0 0 0
down 0 0 0
Next, try submitting a job! You should be able to see that you are running on the first cluster,
but from the second.

.. code-block:: sh
quartz:~$ flux mini run hostname
noodle220
If you are still connected to the first, you should also be able to query the jobs.
E.g., here we submit a sleep from the second connected cluster:

.. code-block:: sh
quartz:~$ flux mini submit sleep 60
f22hdyb35
And then see it from either cluster node!

.. code-block:: sh
$ flux jobs | jq
{
"id": 2272725565440,
"userid": 34633,
"urgency": 16,
"priority": 16,
"t_submit": 1675713045.009863,
"state": 16,
"name": "sleep",
"ntasks": 1,
"nnodes": 1,
"ranks": "2",
"nodelist": "noodle220",
"expiration": 1676317845,
"t_depend": 1675713045.009863,
"t_run": 1675713045.0290241,
"annotations": {
"sched": {
"queue": "default"
}
}
}
And that's it! With this strategy, it should be easy to interact with Flux instances from
two resources where ssh is supported. If you have any questions, please `let us know <https://github.com/flux-framework/flux-docs/issues>`_.
1 change: 1 addition & 0 deletions tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ find a tutorial of interest.

lab/index
integrations/index
commands/index
7 changes: 5 additions & 2 deletions tutorials/lab/coral.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,12 @@ If you are using the ORNL system Summit, run:
module use /sw/summit/modulefiles/ums/gen007flux/linux-rhel8-ppc64le/Core
------------------
.. _launch-flux-on-lassen:

--------------
Launching Flux
------------------
--------------

You can load the latest Flux-team managed installation on LLNL and ORNL CORAL
machines using:
Expand Down

0 comments on commit d2304a9

Please sign in to comment.