How to configure a routine containerized slurm cluster setup? #36

kcandrews · 2024-01-19T14:47:31Z

kcandrews
Jan 19, 2024

I am trying to get started using my organization's SLURM cluster with targets, but it's not really something I'm familiar with. I could use some guidance on how to start the conversation with my system administrator. I'm confused about many things.

First of all, most work is done on an Ubuntu VM which can make SLURM job submissions to the HPC cluster. I also have the option of launching an instance of RStudio on a cluster node inside an Ubuntu container, but in that case the SLURM commands are not available, and my system administrator didn't think it would be right to add them. They thought it would be best to start up the cluster interface from the VM. Would that work? Would there be any limitations for long running workflows? I remember reading in the documentation that there was some expectation of starting things up directly from a node on the cluster. However, I don't actually know if that is essential or if all that is necessary is having access to the slurm cli tools.

Second, the HPC cluster itself is based on CentOS, so I imagine I would need to pass an Ubuntu container pretty similar to the VM in with the script_lines argument. I saw an example in discussion #35 where this was done, but it was a bit more complicated than I expected. It seems that four tasks have to be established:

point to the R executable which matches the current R version being used on the VM
point to the user R libraries which are associated with the current session since the user's home directory is shared over NFS
launch all of this within the available ubuntu container
ensure the working directory of the script matches the working directory on the VM

What about the hostname argument? Do I need to set it?

Whatever is happening, I feel like I need to start with a minimal example and build from there. I had tried to make the targets-minimal project run on our cluster without anything special, but without success.

library(targets)
library(tarchetypes)
library(crew)
library(crew.cluster)

source("R/functions.R")
options(tidyverse.quiet = TRUE)

slurm_controller = crew_controller_slurm(
  name = "slurm",
  workers = 5,
  slurm_memory_gigabytes_per_cpu = 4000
)

tar_option_set(
  packages = c("biglm", "dplyr", "ggplot2", "readr", "tidyr"),
  controller = slurm_controller,
  resources = tar_resources(
    crew = tar_resources_crew(controller = "slurm")
  )
)

list(
  tar_target(
    raw_data_file,
    "data/raw_data.csv",
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv(raw_data_file, col_types = cols())
  ),
  tar_target(
    data,
    raw_data %>%
      filter(!is.na(Ozone))
  ),
  tar_target(hist, create_plot(data)),
  tar_target(fit, biglm(Ozone ~ Wind + Temp, data)),
  tar_render(report, "index.Rmd")
)

When I dropped the controller argument the workflow ran but I didn't any new job submissions to SLURM according to system("sacct -u $USER"), so it seemed like everything ran locally. When I put it in, the workflow pretty much ran forever being stuck on raw_data_file. Nothing ever finished. That's not really helpful for testing. What am I missing? Anyone with slurm cluster experience have suggestions?

Thanks!

wlandau · 2024-01-23T18:09:49Z

wlandau
Jan 23, 2024
Maintainer

First of all, most work is done on an Ubuntu VM which can make SLURM job submissions to the HPC cluster. I also have the option of launching an instance of RStudio on a cluster node inside an Ubuntu container, but in that case the SLURM commands are not available, and my system administrator didn't think it would be right to add them. They thought it would be best to start up the cluster interface from the VM. Would that work? Would there be any limitations for long running workflows? I remember reading in the documentation that there was some expectation of starting things up directly from a node on the cluster. However, I don't actually know if that is essential or if all that is necessary is having access to the slurm cli tools.

RStudio would be more convenient, and this is what my organization does. (And we are a highly regulated industry.) But either way should work as long as (1) the submission/termination commands are available on the cluster, and (2) the jobs you submit can connect back to the VM on the local network using the local IPv4 address of the VM.

Second, the HPC cluster itself is based on CentOS, so I imagine I would need to pass an Ubuntu container pretty similar to the VM in with the script_lines argument. I saw an example in #35 where this was done, but it was a bit more complicated than I expected. It seems that four tasks have to be established:

point to the R executable which matches the current R version being used on the VM

point to the user R libraries which are associated with the current session since the user's home directory is shared over NFS

launch all of this within the available ubuntu container

ensure the working directory of the script matches the working directory on the VM

All this looks right. I will just mention that my organization does not require containers, and we use environment modules for R (e.g. module load R/4.2.2.) IIRC there was some general discussion at #1, and the folks I talked with there will probably have better advice.

What about the hostname argument? Do I need to set it?

This may require trial and error, but hopefully not much. The default is getip::getip(type = "local"), which is the local IPv4 address. If that doesn't work, you may want to try Sys.info()["nodename"]. If neither one works, either you are on a system that that requires IPv6 (which I have not seen yet), or something like a firewall is preventing you from connecting. The latter case is harder, and I think its would need help from your sys admin.

Whatever is happening, I feel like I need to start with a minimal example and build from there.

Yes, a minimal example is a great approach. I recommend working with crew.cluster directly instead of targets.

controller <- crew_controller_slurm(
  name = "slurm",
  workers = 5,
  slurm_memory_gigabytes_per_cpu = 4000
)

controller$push(Sys.info()["nodename"])
controller$wait()
task <- controller$pop()
print(task)
print(task$result[[1L]])

Even before that, using just mirai would help you diagnose network issues. In your local R session, you could run:

library(mirai)
url <- "ws://YOUR_LOCAL_IPV4_ADDRESS:5700" # getip::getip(type = "local")
daemons(n = 1L, url = url)

And in a SLURM job, run:

mirai::daemon(url = "ws://YOUR_LOCAL_IPV4_ADDRESS:5700")

If networking is set up correctly for crew/mirai, then you will see an instance counter greater than 0 in mirai::status(). If the connection is active, online will be 1 as well.

# local R session
status() # from {mirai}
#> $connections
#> [1] 1
#> 
#> $daemons
#>                                   i online instance assigned complete
#> ws://YOUR_LOCAL_IPV4_ADDRESS:5700 1      1        1        0        0

When I dropped the controller argument the workflow ran but I didn't any new job submissions to SLURM according to system("sacct -u $USER"), so it seemed like everything ran locally.

Yes, targets just ran everything locally if there was no controller.

When I put it in, the workflow pretty much ran forever being stuck on raw_data_file. Nothing ever finished.

Could be that the SLURM jobs are either not running or not connecting over the network.

1 reply

wlandau Feb 26, 2024
Maintainer

Another thing: it is critically important that the versions of R packages of the local R session agree with the ones running on your cluster jobs. So I would definitely manually start a job and make it print out the versions, especially crew, mirai, and nanonext.

nviets · 2024-02-27T01:45:14Z

nviets
Feb 27, 2024

We use nix in our HPC environments to handle the same environment issues because nix manages the full stack of software all the way down. The nix package management pairs intuitively with targets+HPC where we need precise control of the environment across a large infrastructure layout. In base R tooling, some knowledge of the start up process can be helpful.

Posit (a.k.a. RStudio) add a layer of complexity because they have different session initialization processes than the base R startup. That can be a bit of gotcha with targets (and any HPC tooling for that matter) where you're commonly launching from an interactive Posit session onto a remote instance of CLI R.

Here are some useful resources for understanding the various - and often tedious - nuances:

Initialization at Start of an R Session
Managing R with .Rprofile, .Renviron, Rprofile.site, Renviron.site, rsession.conf, and repos.conf
For Posit Pro see 6.4.1.2 here
For open source RStudio with Nix, start here. There's a good community channel to get started with it here.

Do note that Posit paid and opensource version have very different startup processes, which can be tricky with HPC stuff.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to configure a routine containerized slurm cluster setup? #36

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

How to configure a routine containerized slurm cluster setup? #36

kcandrews Jan 19, 2024

Replies: 2 comments · 1 reply

wlandau Jan 23, 2024 Maintainer

wlandau Feb 26, 2024 Maintainer

nviets Feb 27, 2024

kcandrews
Jan 19, 2024

Replies: 2 comments 1 reply

wlandau
Jan 23, 2024
Maintainer

wlandau Feb 26, 2024
Maintainer

nviets
Feb 27, 2024