Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tutorial to connect to flux between clusters #192

Merged
merged 1 commit into from
Feb 10, 2023
Merged

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Feb 6, 2023

This uses a proxy jump in the ssh config.

I don't have this working yet, but wanted to open the PR to clearly lay out what I'm doing (so we can see what I'm doing wrong).

Update: all is working! I was missing export FLUX_SSH=ssh and then --force

Signed-off-by: vsoch vsoch@users.noreply.github.com

@vsoch
Copy link
Member Author

vsoch commented Feb 6, 2023

@chu11
Copy link
Member

chu11 commented Feb 6, 2023

High level comment, this feels "less than" a tutorial but "more than" a FAQ entry. Would some type of "how to" section be wise? Where we could put general help kinda things?

@vsoch
Copy link
Member Author

vsoch commented Feb 6, 2023

@chu11 I disagree with your labeling this as "less than" a tutorial - it walks the reader through a complete process, and I would argue is exactly the right size to grab and maintain attention. If it was a one off command? Then maybe it wouldn't be a tutorial. But it's an entire process with multiple processes and steps, thus it falls nicely here.

We also should not have everything bunched under "FAQ" - it's already too busy. This in particular is specific to LC systems (we cannot guarantee it would work on others we have not tested) and I think is placed exactly where I'd want to find it.

Where we could put general help kinda things?

I do agree there should be something between FAQ and tutorial, and for general things, although I'm not sure what that looks like. Maybe a documentation page for each command and then detailed examples of how to do things? Or (if we aren't ready for that yet) a kind of cheat sheet? E.g., as a user what I find useful (when I'm looking to submit, for example) is a section like this https://rse-ops.github.io/knowledge/docs/schedulers/slurm.html#command-quick-reference subset to a specific command. I really just want to see what I'm looking for, copy paste, edit, and go (without digging through tutorial or FAQ really).

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of quick comments inline

----------------------


First, let's create the allocation on the first cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a Flux instance on lassen is covered here. Generally, you'd want to launch the instance as a parallel job under the native resource manager, rather than get an allocation and use flux-start --test-size=N. The test instance ignores the native resource allocation and just starts N brokers in place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to generalize it a bit? On clusters with native flux you might typically do ... on clusters with slurm ... on lassen see ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grondo this command?

$ jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n ${NUM_NODES} --bind=none --smpiargs="-disable_gpu_hooks" flux start

This definitely calls for a "Rosetta stone" of starting flux commands - e.g., I could imagine the same but for slurm, or another job manager. I think there should be one place where someone can go and see them all side by side (although separate from this PR). What I can do is put this link / reference in the tutorial, and I'm still thinking about how we can structure a more general set of mini tutorials.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chu11 I agree we want this "how to start flux" across different places somewhere, maybe not here but definitely somewhere, along with other useful commands and different contexts / quick reference for running them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we already had the how to launch a sub-instance on FAQ, but I guess not ... seems like we should definitely add something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the lab's tutorial that @ryanday36 wrote.

>salloc -N4 --exclusive
salloc: Granted job allocation 321075
salloc: Waiting for resource configuration
salloc: Nodes opal[63-66] are ready for job
>srun -N4 -n4 --pty --mpibind=off flux start
>flux mini run -N4 hostname
opal63
opal64
opal65
opal66

Copy link
Member

@chu11 chu11 Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way from a manpage is: srun --pty -N8 flux start (i.e. no need to salloc I think)

I'm actually struggling to find a way to do it A) without a pty and B) allow the user to easily get the FLUX_URI ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - so it's getting an allocation first still, but then running flux start with srun and then you can interact. Let me update the tutorial to use that, and if I can get access to a slurm cluster sometime with flux I can test again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually struggling to find a way to do it A) without a pty and B) allow the user to easily get the FLUX_URI ...

Ahh flux uri has a slurm resolver based on a searching pids.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know jsrun usage either. You can also connect directly to a flux instance running under Slurm with flux proxy slurm:JOBID. This is documented in Working with Flux job hierarchies

Comment on lines 40 to 52
And make sure to get the hostname for your allocation:

.. code-block:: sh

$ hostname
lassen220

And you can echo ``$FLUX_URI`` to see the path of the socket that you will also need later:

.. code-block:: sh

$ local:///var/tmp/flux-MLmxy2/local-0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a convenience you can also use flux uri --remote lsf:3750480 or --local to get that info from a lassen login node if you know the LSF job ID running flux. (That won't work if you use flux start --test-size though)

@springme
Copy link

springme commented Feb 6, 2023 via email

@chu11
Copy link
Member

chu11 commented Feb 6, 2023

I do agree there should be something between FAQ and tutorial, and for general things, although I'm not sure what that looks like.

Perhaps just a bullet under the tutorials section for "General How Tos" or something could suffice for now. And if it grows large enough we split out into another page.

I now realize that you added this under the "Lab tutorials" section. In my mind it was more "general" at first, but perhaps it swerves lab specific a bit.

@vsoch
Copy link
Member Author

vsoch commented Feb 6, 2023

okay so here is an intermediate idea - what if we have "Command Tutorials" e.g., a section here:

image

And then under there, we have each of the commands, and flux proxy would be one of them (and I'm happy to try and generalize the current doc here and move away from being under lab tutorials, although I cannot guarantee it would be the same on other lab clusters). In these Command tutorials we would generally try to show a lot of examples and contexts for doing something, and if there is a "source of truth" in one of the rc guides or similar, we'd link to it.

I think under Tutorials is the right place - I'm looking at FAQ and it's really busy - at best someone is going to find matching content here via a search. The other place I go is "Quickstart" but given that is supposed to be quick, I'm hesitant to add more content there (although I do think we can work on it more). So that leaves us with Guides and Tutorials, and a third option to add something that doesn't exist yet. Guides (imho) are complete guides for an entire user base, so I don't think it goes there. It could fall under a new category "Command Guide?" or similar, but I think that could also fit under tutorials since we just have two types at the moment. What do others think?

The format would be like:

# This is the main Command Tutorials page

- `flux proxy`: "I want to connect to a flux instance across clusters with ssh
- `flux start`: "I want to start my own flux allocation/instance to launch jobs"
- `flux mini submit/run`: "I want to run or submit a job to a flux instance"

Etc. We would want a user to come to that main page, find their use case, and then go to the respective guide. This might also be a good opportunity to better connect with those (currently separate) rc guides. We could also build our command line helper from this guide, because we'd have a set of example commands for each group.

@chu11
Copy link
Member

chu11 commented Feb 6, 2023

# This is the main Command Tutorials page
- `flux proxy`: "I want to connect to a flux instance across clusters with ssh
- `flux start`: "I want to start my own flux allocation/instance to launch jobs"
- `flux mini submit/run`: "I want to run or submit a job to a flux instance"

I like this idea, we could have different "levels" of tutorials too. I'm imagining

- `flux proxy`: "send commands to a flux instance you've started"
- "flux proxy": "send commands to a flux instance across clusters using ssh"

@vsoch
Copy link
Member Author

vsoch commented Feb 6, 2023

I like this idea, we could have different "levels" of tutorials too. I'm imagining

Yes!! OK let me start us on this path - the pages will be a bit empty to start (one tutorial!) but we have to start somewhere!

@vsoch
Copy link
Member Author

vsoch commented Feb 6, 2023

okay new "Commands Tutorials" are added! https://flux-framework--192.org.readthedocs.build/en/192/tutorials/commands/index.html But we should not merge yet because the jsrun command didn't work for me (see comment above) #192 (comment)

@chu11
Copy link
Member

chu11 commented Feb 6, 2023

high level comment as I'm reading the tutorial, maybe put the hostname into the prompts, like:

lassen:~$

or something. So there's a little bit that differentiates the output from when you're typing on lassen vs quartz (as some output like flux resource list is identical).

@vsoch
Copy link
Member Author

vsoch commented Feb 6, 2023

suggested todos:

  • for this PR: get the working command for flux start on lassen (so I can test)
  • I can start an issue that lists the guides / "command tutorials" we want to write (and post ideas here) Command Tutorials to do #193

Going for a run backinabit.

@vsoch vsoch mentioned this pull request Feb 7, 2023
16 tasks
@vsoch vsoch force-pushed the add/ssh-tutorial branch 2 times, most recently from d47aad7 to 6a1d499 Compare February 9, 2023 18:58
@vsoch
Copy link
Member Author

vsoch commented Feb 9, 2023

Updated preview: https://flux-framework--192.org.readthedocs.build/en/192/tutorials/commands/ssh-across-clusters.html

  • I added the srun example to start the instance
  • I generalized the name of the first cluster so it's not heavy with our cluster names!

Comment on lines +21 to +22
$ salloc -N4 --exclusive
$ srun -N4 -n4 --pty --mpibind=off flux start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: to differentiate between login node and host you allocated/are running on do noodle:~$ vs noodle220:~$ at prompts here and below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were on some general noodle node (that is part of the allocation?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I added a comment at the top that they are on a login node, and then when they hit "noodle" they have their allocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for the prompt change is b/c of your ssh config below. noodle appears to be the login node, while noodle220 is the node that you were allocated to run your job. So it may not be clear which node you're actually on with just noodle:~$??

For example, w/ my personal prompt:

opal186|/g/g0/achu 44>salloc -N4 -ppbatch
salloc: Granted job allocation 321078
salloc: Waiting for resource configuration
salloc: Nodes opal[63-66] are ready for job

opal63|/g/g0/achu 21>srun hostname
opal63
opal66
opal65
opal64

you'll notice I was on opal186, the login node, and then salloc dropped me into a shell on opal63. So if I were to setup the ssh config, I would think I should set it up for opal186 and opal63.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it so the login node is just empty (no name) and I explicitly state we are on the login node. Then I state we are on the allocation and just use noodle:~$ to say we are on the allocation (and the specific node largely doesn't matter). I thought it looked nicer without the number so I left it out.

tutorials/commands/ssh-across-clusters.rst Outdated Show resolved Hide resolved
}

And that's it! With this strategy, it should be easy to interact with Flux instances from
two resources where ssh is supported. If you have any questions, please `let us know <https://github.com/flux-framework/flux-docs/issues>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have any questions, please let us know <https://github.com/flux-framework/flux-docs/issues>_.

Instead of having something like this in every tutorial, perhaps we just need a header/footer comment or something within the "tutorials" page?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or ... if people get to this page via search engines, perhaps it good to have at the bottom.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then if they are in a specific tutorial, they wouldn't see it right? It's important (I think) to be at the bottom of the page so the reader might glance through the tutorial, and then feel like they are missing something / have a question and immediately see that.

@vsoch vsoch force-pushed the add/ssh-tutorial branch 2 times, most recently from d2304a9 to a2674a8 Compare February 9, 2023 21:37
this uses a proxy jump in the ssh config.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for getting this new batch of tutorials / docs starts.

@vsoch
Copy link
Member Author

vsoch commented Feb 9, 2023

@grondo I think we can merge with your blessing? And then unblock @cmoussa1.

@vsoch vsoch added the merge-when-passing mark PR for auto-merging by mergify.io bot label Feb 10, 2023
@vsoch
Copy link
Member Author

vsoch commented Feb 10, 2023

Note sure if I'm allowed to do this, but since @chu11 approved, I set that label.

@mergify mergify bot merged commit dc61a2f into master Feb 10, 2023
@vsoch vsoch deleted the add/ssh-tutorial branch February 10, 2023 02:32
@chu11
Copy link
Member

chu11 commented Feb 10, 2023

yes, once approved you can set MWP. Sometimes if the reviewer wants another to skim before you set MWP we'll say "oh hey such and such should take a look though b/c I'm not an expert on FOO".

@vsoch
Copy link
Member Author

vsoch commented Feb 10, 2023

Thank you for approving!

@vsoch
Copy link
Member Author

vsoch commented Feb 10, 2023

@chu11 I'm not great with using flux commands, but if you want to point me at a particular example from the workflows repo (the one we are going to archive), I can work on integrating that here next. It will be good for the docs and my learning. :)

@chu11
Copy link
Member

chu11 commented Feb 10, 2023

In my opinion we may eventually want to add python equivalents to the CLI tutorials, like "oh hey, you wanna do this in python, do it this way" .... like there could be a python job-submit one after PR #194 . Then we can eventually combine them all into an advanced one, like the python equivalent to my PR #195.

Problem is the python scripts in flux-workflow-examples probably aren't up to date. It would not surprise me if some of them didn't work at all.

@vsoch
Copy link
Member Author

vsoch commented Feb 10, 2023

I'm actually more familiar with the python cli (given the small amount of work I've done with it) so that might be a good place for me to start. I'll see if I can add sphinx-gallery too so the tutorials render with the current version of flux on build. I'll need to add a .devcontainer environment for us to work from.

Will start soon, latest next week!

@vsoch
Copy link
Member Author

vsoch commented Feb 10, 2023

Okay, step 1: #201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mark PR for auto-merging by mergify.io bot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants