Hyperqueue at Metacentrum (autofs, heterogenous HPC system, /home is symlink ) together with own on-prem HPC #733

jose-d · 2024-07-23T11:49:49Z

jose-d
Jul 23, 2024

Goal of this disscusion thread:

feed search engine with working example
collect feedback and experience from other users of CESNET Metacentrum
show how HQ is helping to do science at Institute of Physics, ASCR :)

Our usecase for HQ

We aim to create unifying layer on top of our shared compute resource provider, Metacentrum/CESNET and our on-prem HPC cluster.

Our cluster is firmly in our hands, and our users are provided with extensive scientific support. But we understand the benefits of letting our users expand to external providers, like Cesnet Metacentrum, IT4i, Lumi, etc, especially in cases when the application is stable and required tooling is already well understood by user, and we need only .. CPUs.

Without hq, users manually distribute their workloads between multiple HPC providers, writing both PBS and Slurm scripts .. and I think that is the right moment to bring hq into game:

Problem we observed

So lets submit one hq worker at metacentrum, and one hq worker at our on-prem system, and submit some dummy job to them (eg. hostname, or sleep 1, etc. ). So taking the quick start .... and .. nope:

(BOOKWORM)jose@tarkil:~$ ./tools/hq job info 1
+----------------------+--------------------------------------------------------+
| ID                   | 1                                                      |
| Name                 | sleep                                                  |
| State                | FAILED                                                 |
| Tasks                | 1; Ids: 0                                              |
| Workers              |                                                        |
| Resources            | cpus: 1 compact                                        |
| Priority             | 0                                                      |
| Command              | sleep                                                  |
|                      | 1                                                      |
| Stdout               | /auto/vestec1-elixir/home/jose/job-1/%{TASK_ID}.stdout |
| Stderr               | /auto/vestec1-elixir/home/jose/job-1/%{TASK_ID}.stderr |
| Environment          |                                                        |
| Working directory    | /auto/vestec1-elixir/home/jose                         |
| Task time limit      | None                                                   |
| Crash limit          | 5                                                      |
| Submission date      | 2024-07-23 09:36:54 UTC                                |
| Submission directory | /auto/vestec1-elixir/home/jose                         |
| Makespan             | 2ms                                                    |
+----------------------+--------------------------------------------------------+
+---------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Task ID | Worker | Error                                                                                                                                                                                          |
+---------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0       |        | Error: Cannot create stdout directory at File { path: "/auto/vestec1-elixir/home/jose/job-1/0.stdout", on_close: None }: Os { code: 13, kind: PermissionDenied, message: "Permission denied" } |
+---------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 tasks failed.
(BOOKWORM)jose@tarkil:~$

My understanding of this problem

Metacentrum is federation of multiple clusters. When job is submitted to the scheduler (PBS), by default, job can be executed at any cluster
Different clusters have /home implemented as symlink to different directories
also $HOME env var of our metacentrum worker is different than $HOME of our on-prem worker

eg. cluster skirit:

$ ll /home
lrwxrwxrwx 1 root root 20 čec 21  2020 /home -> /storage/praha1/home/

eg. cluster tarkil:

$ ls -lah /home
lrwxrwxrwx 1 root root 19 15. zář  2020 /home -> /storage/brno2/home

and we need to go deeper, because at least for praha1, that's another symlink, but that's already real directory, nice :)

$ ll /storage/praha1/home
lrwxrwxrwx 1 root root 25 úno  3  2021 /storage/praha1/home -> /auto/vestec1-elixir/home/

We can also see, that variable ${HOME} differs across multiple clusters in federation, so if job gets alocation at kirke4.meta.zcu.cz, $HOME is /storage/plzen1/home/jose where are indeed different files than in /storage/praha1.

That means, that we can not reliably use ${HOME} at all. So...

Workaround / solution:

in hq worker launchers, export REALHOME pointing to system-specific directory where our app.sh is located:

Metacentrum PBS worker batch script:

#!/bin/bash
#PBS -N testing_HQ
#PBS -l select=1:ncpus=4:mem=1gb
#PBS -l walltime=0:30:00

hq_dir='/storage/praha1/home/jose/tools/'
credsdir="/storage/praha1/home/jose/projects/2024_07_22__distributed_hyperqueue/wd"

export REALHOME=/storage/praha1/home/jose

${hq_dir}/hq worker --server-dir=${credsdir} start --manager pbs

Koios (one of our on-prem systems) Slurm batch script:

#!/bin/bash
  
#SBATCH --job-name=hq_worker
#SBATCH --time=00:33:33
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1

hq_dir='/home/jose/tools/hq'
credsdir="/home/jose/projects/2024_07_18__hyperqueue"

export REALHOME=/home/jose

${hq_dir}/hq worker --server-dir=${credsdir} start --manager slurm

submit app.sh job:

./tools/hq submit --array=1-30 --log=aa.log --cwd=/tmp bash -c '$REALHOME/app.sh'

Note that we use aa.log to stream the logs over hq and set cwd to /tmp.

Without setting cwd to /tmp the job fails due to /home and $HOME concept at Metacentrum.

Kobzol · 2024-07-23T12:12:20Z

Kobzol
Jul 23, 2024
Maintainer

Hi, this is indeed an interesting use-case. As you have probably noticed, HyperQueue by default assumes a shared filesystem. It's not that it cannot work without it, but it makes certain decisions that are designed to make the life of users on clusters with a shared filesystem much easier (but this also conversely makes the life of users on clusters without a shared filesystem harder) :)

Regarding the home directory, I don't think there's that much that HQ can do for you. If you are sharing the connection information through the filesystem automatically, you will need to remap the server directory paths somehow, as you have shown. Note that there is also support for running HQ on clusters without any filesystem sharing at all (https://it4innovations.github.io/hyperqueue/stable/deployment/cloud/), although I'm not sure if it will be that useful for your use-case.

What is more annoying is the setting of task-level paths (working directory, stdout and stderr). In theory, it would be possible to just use relative paths for these (e.g. --cwd=foo), and these could then be resolved relatively to the working directory of the worker, which would mostly resolve the issues caused by the filesystem not being shared. However, HQ currently expands these paths eagerly relative to the submission directory, so if you specify --cwd=foo, it will not be set to <worker-cwd>/foo, but to <submit-dir>/foo and that won't help your use-case. It was designed like this explicitly, it's usually much nicer if the data appears in the directory from which you submit, rather than in some random directory where the worker is started (often it is started automatically by the automatic allocator, who knows where :) ). However, for your use-case it's a bit unfortunate.

Currently, you're getting around this by using a directory that is accessible everywhere (/tmp) as a working directory, and by using I/O streaming for avoiding having to deal with stdout and stderr, but that is of course not a great solution.

I think that we could resolve this by adding a new placeholder called e.g. %{WORKER_CWD}, and then you could set --cwd=%{WORKER_CWD}/foo, which would mostly resolve issues related to paths on workers. Let me know what do you think about this, would it help your use-case? (I think that in any case, this would be a useful addition, @spirali what do you think?).

By the way, HQ was mostly designed to run inside a single cluster (with a shared filesystem), as you have probably already noticed. While using it across clusters does partially work (and I know that some people are using it in that way), I wonder if perhaps a different solution for that might be better, e.g. something like Lexis (https://portal.beta.lexis.tech/login, https://www.e-infra.cz/file/76b003888693abdd9a06846a20cd5d00/1222/2_LEXIS_eInfraCZ_2024.pdf) or HEAppE (https://heappe.eu).

In any case, let me know what do you think, or if you have any other suggestions/requests.

3 replies

jose-d Jul 23, 2024
Author

there is also support for running HQ on clusters without any filesystem sharing at all (https://it4innovations.github.io/hyperqueue/stable/deployment/cloud/), although I'm not sure if it will be that useful for your use-case.

without this feature, our scenario would be impossible! It is very helpful, that we can take access.json literally anywhere and connect to any hq server.

so if you specify --cwd=foo, it will not be set to <worker-cwd>/foo, but to <submit-dir>/foo

exactly! and submit-dir is the /{metacentrum-specific}/home/{username}, it is non-existent at our facilities, where we have real /home. We even have users with different usernames at different systems, so then the paths are broken in even more obvious way.

That IMO affects dealing with the stdout and stderr logfiles too - because logfiles directory is derived from <submit-dir> too, right?

I think that we could resolve this by adding a new placeholder called e.g. %{WORKER_CWD}, and then you could set --cwd=%{WORKER_CWD}/foo

yes. This would fix the path to user app in home, not sure if that would fix the stdout and stderr dirs too?

Kobzol Jul 23, 2024
Maintainer

without this feature, our scenario would be impossible! It is very helpful, that we can take access.json literally anywhere and connect to any hq server.

I see! Good to know :)

That IMO affects dealing with the stdout and stderr logfiles too - because logfiles directory is derived from too, right?

If you don't use output streaming (so in the default case), then yes, this also affects the location of the stdout/stderr files in the same way. With output streaming, the server writes stdout/stderr into a single file (usually stored on the host node), so there it should be fine.

Note that stdout/stderr paths are already relative to the working directory by default, so just changing --cwd to e.g. %{WORKER_DIR}/%{TASK_ID} should also fix stdout/stderr file paths.

yes. This would fix the path to user app in home, not sure if that would fix the stdout and stderr dirs too?

Hmm, actually it should be the opposite 😆 With %{WORKER_CMD}, you could set the cwd/stdout/stderr paths of a job to be relative to the worker directory, thus resolving the issues with different filesystems on different workers.

This wouldn't help the configuration of the server directory though, because that does not use any placeholders like %{WORKER_CMD}. Users usually have control over starting the worker, so they can configure the right server directory at that time. This is different from submitting jobs, where you don't know beforehand on which worker will a givne task execute.

By the way, if you have a specific workflow/use-case for which you're using sucessfully HyperQueue, and you'd be OK with sharing it with me, I would love to hear about it! You can contact me at berykubik@gmail.com if you'd like to share some more details. I'm currently trying to gather some real-world usages of HQ so that I can describe them in my PhD thesis :)

jose-d Jul 23, 2024
Author

I'm currently trying to gather some real-world usages of HQ so that I can describe them in my PhD thesis :)

sure, my colleague I am supporting ( I'm HPC guy in our team ) does research in area of gravitational waves. The problem he is exploring can be decomposed into many independent sub-problems.

I asked him to take look at this thread and possibly write few sentences about the scientific essence of this code. From ethical reasons I can not speak about scientific concepts myself, especially before publication is out.

But I can tell what we expect we'll gain when using hyperqueue / why we decided to use it:

unified access to multiple clusters. We have two on-prem clusters + Metacentrum. We're thinking about IT4i. Splitting problem between multiple systems manually is ineffective, indeed in various times the systems have various amount of available nodes, jobs are subject of QoS rules, so by using hq we get something like loadbalancing..
resiliency - sometimes job array element hangs and is killed. Hq can reschedule it.
increase the amount of job array elements. Our Slurm can deal with quite large (N00 000) job arrays, PBS@Metacentrum support only small ones.

Well, and all that is possibly answer why we went with hqand not with Lexis or HEAppE.. hq seems to be quite transparent, the concepts are straightforward. After few initial sessions and P-O-C I am sure my colleague will be able to use it himself, tweak the jobscripts, and use it for his future research projects. With Lexis or HEAppE.. it looks we'd need to dedicate some FTE just for that! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperqueue at Metacentrum (autofs, heterogenous HPC system, /home is symlink ) together with own on-prem HPC #733

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Hyperqueue at Metacentrum (autofs, heterogenous HPC system, /home is symlink ) together with own on-prem HPC #733

jose-d Jul 23, 2024

Replies: 1 comment · 3 replies

Kobzol Jul 23, 2024 Maintainer

jose-d Jul 23, 2024 Author

Kobzol Jul 23, 2024 Maintainer

jose-d Jul 23, 2024 Author

jose-d
Jul 23, 2024

Replies: 1 comment 3 replies

Kobzol
Jul 23, 2024
Maintainer

jose-d Jul 23, 2024
Author

Kobzol Jul 23, 2024
Maintainer

jose-d Jul 23, 2024
Author