Replies: 2 comments 1 reply
-
RStudio would be more convenient, and this is what my organization does. (And we are a highly regulated industry.) But either way should work as long as (1) the submission/termination commands are available on the cluster, and (2) the jobs you submit can connect back to the VM on the local network using the local IPv4 address of the VM.
All this looks right. I will just mention that my organization does not require containers, and we use environment modules for R (e.g.
This may require trial and error, but hopefully not much. The default is
Yes, a minimal example is a great approach. I recommend working with controller <- crew_controller_slurm(
name = "slurm",
workers = 5,
slurm_memory_gigabytes_per_cpu = 4000
)
controller$push(Sys.info()["nodename"])
controller$wait()
task <- controller$pop()
print(task)
print(task$result[[1L]]) Even before that, using just library(mirai)
url <- "ws://YOUR_LOCAL_IPV4_ADDRESS:5700" # getip::getip(type = "local")
daemons(n = 1L, url = url) And in a SLURM job, run: mirai::daemon(url = "ws://YOUR_LOCAL_IPV4_ADDRESS:5700") If networking is set up correctly for # local R session
status() # from {mirai}
#> $connections
#> [1] 1
#>
#> $daemons
#> i online instance assigned complete
#> ws://YOUR_LOCAL_IPV4_ADDRESS:5700 1 1 1 0 0
Yes,
Could be that the SLURM jobs are either not running or not connecting over the network. |
Beta Was this translation helpful? Give feedback.
-
We use nix in our HPC environments to handle the same environment issues because nix manages the full stack of software all the way down. The nix package management pairs intuitively with targets+HPC where we need precise control of the environment across a large infrastructure layout. In base R tooling, some knowledge of the start up process can be helpful. Posit (a.k.a. RStudio) add a layer of complexity because they have different session initialization processes than the base R startup. That can be a bit of gotcha with targets (and any HPC tooling for that matter) where you're commonly launching from an interactive Posit session onto a remote instance of CLI R. Here are some useful resources for understanding the various - and often tedious - nuances:
Do note that Posit paid and opensource version have very different startup processes, which can be tricky with HPC stuff. |
Beta Was this translation helpful? Give feedback.
-
I am trying to get started using my organization's SLURM cluster with targets, but it's not really something I'm familiar with. I could use some guidance on how to start the conversation with my system administrator. I'm confused about many things.
First of all, most work is done on an Ubuntu VM which can make SLURM job submissions to the HPC cluster. I also have the option of launching an instance of RStudio on a cluster node inside an Ubuntu container, but in that case the SLURM commands are not available, and my system administrator didn't think it would be right to add them. They thought it would be best to start up the cluster interface from the VM. Would that work? Would there be any limitations for long running workflows? I remember reading in the documentation that there was some expectation of starting things up directly from a node on the cluster. However, I don't actually know if that is essential or if all that is necessary is having access to the slurm cli tools.
Second, the HPC cluster itself is based on CentOS, so I imagine I would need to pass an Ubuntu container pretty similar to the VM in with the
script_lines
argument. I saw an example in discussion #35 where this was done, but it was a bit more complicated than I expected. It seems that four tasks have to be established:What about the hostname argument? Do I need to set it?
Whatever is happening, I feel like I need to start with a minimal example and build from there. I had tried to make the targets-minimal project run on our cluster without anything special, but without success.
When I dropped the controller argument the workflow ran but I didn't any new job submissions to SLURM according to
system("sacct -u $USER")
, so it seemed like everything ran locally. When I put it in, the workflow pretty much ran forever being stuck onraw_data_file
. Nothing ever finished. That's not really helpful for testing. What am I missing? Anyone with slurm cluster experience have suggestions?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions