-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tutorial to connect to flux between clusters #192
Conversation
f4f71f8
to
60422b6
Compare
High level comment, this feels "less than" a tutorial but "more than" a FAQ entry. Would some type of "how to" section be wise? Where we could put general help kinda things? |
@chu11 I disagree with your labeling this as "less than" a tutorial - it walks the reader through a complete process, and I would argue is exactly the right size to grab and maintain attention. If it was a one off command? Then maybe it wouldn't be a tutorial. But it's an entire process with multiple processes and steps, thus it falls nicely here. We also should not have everything bunched under "FAQ" - it's already too busy. This in particular is specific to LC systems (we cannot guarantee it would work on others we have not tested) and I think is placed exactly where I'd want to find it.
I do agree there should be something between FAQ and tutorial, and for general things, although I'm not sure what that looks like. Maybe a documentation page for each command and then detailed examples of how to do things? Or (if we aren't ready for that yet) a kind of cheat sheet? E.g., as a user what I find useful (when I'm looking to submit, for example) is a section like this https://rse-ops.github.io/knowledge/docs/schedulers/slurm.html#command-quick-reference subset to a specific command. I really just want to see what I'm looking for, copy paste, edit, and go (without digging through tutorial or FAQ really). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of quick comments inline
tutorials/lab/ssh.rst
Outdated
---------------------- | ||
|
||
|
||
First, let's create the allocation on the first cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting a Flux instance on lassen is covered here. Generally, you'd want to launch the instance as a parallel job under the native resource manager, rather than get an allocation and use flux-start --test-size=N
. The test instance ignores the native resource allocation and just starts N brokers in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to generalize it a bit? On clusters with native flux you might typically do ... on clusters with slurm ... on lassen see ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grondo this command?
$ jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n ${NUM_NODES} --bind=none --smpiargs="-disable_gpu_hooks" flux start
This definitely calls for a "Rosetta stone" of starting flux commands - e.g., I could imagine the same but for slurm, or another job manager. I think there should be one place where someone can go and see them all side by side (although separate from this PR). What I can do is put this link / reference in the tutorial, and I'm still thinking about how we can structure a more general set of mini tutorials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chu11 I agree we want this "how to start flux" across different places somewhere, maybe not here but definitely somewhere, along with other useful commands and different contexts / quick reference for running them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought we already had the how to launch a sub-instance on FAQ, but I guess not ... seems like we should definitely add something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the lab's tutorial that @ryanday36 wrote.
>salloc -N4 --exclusive
salloc: Granted job allocation 321075
salloc: Waiting for resource configuration
salloc: Nodes opal[63-66] are ready for job
>srun -N4 -n4 --pty --mpibind=off flux start
>flux mini run -N4 hostname
opal63
opal64
opal65
opal66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way from a manpage is: srun --pty -N8 flux start
(i.e. no need to salloc I think)
I'm actually struggling to find a way to do it A) without a pty and B) allow the user to easily get the FLUX_URI ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - so it's getting an allocation first still, but then running flux start with srun and then you can interact. Let me update the tutorial to use that, and if I can get access to a slurm cluster sometime with flux I can test again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually struggling to find a way to do it A) without a pty and B) allow the user to easily get the FLUX_URI ...
Ahh flux uri
has a slurm resolver based on a searching pids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know jsrun
usage either. You can also connect directly to a flux instance running under Slurm with flux proxy slurm:JOBID
. This is documented in Working with Flux job hierarchies
tutorials/lab/ssh.rst
Outdated
And make sure to get the hostname for your allocation: | ||
|
||
.. code-block:: sh | ||
|
||
$ hostname | ||
lassen220 | ||
|
||
And you can echo ``$FLUX_URI`` to see the path of the socket that you will also need later: | ||
|
||
.. code-block:: sh | ||
|
||
$ local:///var/tmp/flux-MLmxy2/local-0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a convenience you can also use flux uri --remote lsf:3750480
or --local
to get that info from a lassen login node if you know the LSF job ID running flux. (That won't work if you use flux start --test-size
though)
To Al’s suggestion that it’s more than what belongs in a FAQ, maybe there could be levels of Tutorial where some are “How To” tutorials and others are lengthy multi-topic tutorials that may require a lot more time to view.
If there are lots of them there might be some good ways to group them.
|
Perhaps just a bullet under the tutorials section for "General How Tos" or something could suffice for now. And if it grows large enough we split out into another page. I now realize that you added this under the "Lab tutorials" section. In my mind it was more "general" at first, but perhaps it swerves lab specific a bit. |
okay so here is an intermediate idea - what if we have "Command Tutorials" e.g., a section here: And then under there, we have each of the commands, and flux proxy would be one of them (and I'm happy to try and generalize the current doc here and move away from being under lab tutorials, although I cannot guarantee it would be the same on other lab clusters). In these Command tutorials we would generally try to show a lot of examples and contexts for doing something, and if there is a "source of truth" in one of the rc guides or similar, we'd link to it. I think under Tutorials is the right place - I'm looking at FAQ and it's really busy - at best someone is going to find matching content here via a search. The other place I go is "Quickstart" but given that is supposed to be quick, I'm hesitant to add more content there (although I do think we can work on it more). So that leaves us with Guides and Tutorials, and a third option to add something that doesn't exist yet. Guides (imho) are complete guides for an entire user base, so I don't think it goes there. It could fall under a new category "Command Guide?" or similar, but I think that could also fit under tutorials since we just have two types at the moment. What do others think? The format would be like:
Etc. We would want a user to come to that main page, find their use case, and then go to the respective guide. This might also be a good opportunity to better connect with those (currently separate) rc guides. We could also build our command line helper from this guide, because we'd have a set of example commands for each group. |
I like this idea, we could have different "levels" of tutorials too. I'm imagining
|
Yes!! OK let me start us on this path - the pages will be a bit empty to start (one tutorial!) but we have to start somewhere! |
60422b6
to
b1de0aa
Compare
okay new "Commands Tutorials" are added! https://flux-framework--192.org.readthedocs.build/en/192/tutorials/commands/index.html But we should not merge yet because the jsrun command didn't work for me (see comment above) #192 (comment) |
high level comment as I'm reading the tutorial, maybe put the hostname into the prompts, like:
or something. So there's a little bit that differentiates the output from when you're typing on lassen vs quartz (as some output like |
b1de0aa
to
874b2dc
Compare
suggested todos:
Going for a run backinabit. |
d47aad7
to
6a1d499
Compare
Updated preview: https://flux-framework--192.org.readthedocs.build/en/192/tutorials/commands/ssh-across-clusters.html
|
6a1d499
to
23c5751
Compare
$ salloc -N4 --exclusive | ||
$ srun -N4 -n4 --pty --mpibind=off flux start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea: to differentiate between login node and host you allocated/are running on do noodle:~$
vs noodle220:~$
at prompts here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were on some general noodle node (that is part of the allocation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I added a comment at the top that they are on a login node, and then when they hit "noodle" they have their allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning for the prompt change is b/c of your ssh config below. noodle
appears to be the login node, while noodle220
is the node that you were allocated to run your job. So it may not be clear which node you're actually on with just noodle:~$
??
For example, w/ my personal prompt:
opal186|/g/g0/achu 44>salloc -N4 -ppbatch
salloc: Granted job allocation 321078
salloc: Waiting for resource configuration
salloc: Nodes opal[63-66] are ready for job
opal63|/g/g0/achu 21>srun hostname
opal63
opal66
opal65
opal64
you'll notice I was on opal186, the login node, and then salloc
dropped me into a shell on opal63
. So if I were to setup the ssh config, I would think I should set it up for opal186
and opal63
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it so the login node is just empty (no name) and I explicitly state we are on the login node. Then I state we are on the allocation and just use noodle:~$
to say we are on the allocation (and the specific node largely doesn't matter). I thought it looked nicer without the number so I left it out.
} | ||
|
||
And that's it! With this strategy, it should be easy to interact with Flux instances from | ||
two resources where ssh is supported. If you have any questions, please `let us know <https://github.com/flux-framework/flux-docs/issues>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have any questions, please
let us know <https://github.com/flux-framework/flux-docs/issues>
_.
Instead of having something like this in every tutorial, perhaps we just need a header/footer comment or something within the "tutorials" page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or ... if people get to this page via search engines, perhaps it good to have at the bottom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then if they are in a specific tutorial, they wouldn't see it right? It's important (I think) to be at the bottom of the page so the reader might glance through the tutorial, and then feel like they are missing something / have a question and immediately see that.
d2304a9
to
a2674a8
Compare
this uses a proxy jump in the ssh config. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
a2674a8
to
b37ae75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for getting this new batch of tutorials / docs starts.
Note sure if I'm allowed to do this, but since @chu11 approved, I set that label. |
yes, once approved you can set MWP. Sometimes if the reviewer wants another to skim before you set MWP we'll say "oh hey such and such should take a look though b/c I'm not an expert on FOO". |
Thank you for approving! |
@chu11 I'm not great with using flux commands, but if you want to point me at a particular example from the workflows repo (the one we are going to archive), I can work on integrating that here next. It will be good for the docs and my learning. :) |
In my opinion we may eventually want to add python equivalents to the CLI tutorials, like "oh hey, you wanna do this in python, do it this way" .... like there could be a python job-submit one after PR #194 . Then we can eventually combine them all into an advanced one, like the python equivalent to my PR #195. Problem is the python scripts in flux-workflow-examples probably aren't up to date. It would not surprise me if some of them didn't work at all. |
I'm actually more familiar with the python cli (given the small amount of work I've done with it) so that might be a good place for me to start. I'll see if I can add sphinx-gallery too so the tutorials render with the current version of flux on build. I'll need to add a .devcontainer environment for us to work from. Will start soon, latest next week! |
Okay, step 1: #201 |
This uses a proxy jump in the ssh config.
I don't have this working yet, but wanted to open the PR to clearly lay out what I'm doing (so we can see what I'm doing wrong).
Update: all is working! I was missing
export FLUX_SSH=ssh
and then--force
Signed-off-by: vsoch vsoch@users.noreply.github.com