Hyperqueue at Metacentrum (autofs, heterogenous HPC system, /home is symlink ) together with own on-prem HPC #733
Replies: 1 comment 3 replies
-
Hi, this is indeed an interesting use-case. As you have probably noticed, HyperQueue by default assumes a shared filesystem. It's not that it cannot work without it, but it makes certain decisions that are designed to make the life of users on clusters with a shared filesystem much easier (but this also conversely makes the life of users on clusters without a shared filesystem harder) :) Regarding the home directory, I don't think there's that much that HQ can do for you. If you are sharing the connection information through the filesystem automatically, you will need to remap the server directory paths somehow, as you have shown. Note that there is also support for running HQ on clusters without any filesystem sharing at all (https://it4innovations.github.io/hyperqueue/stable/deployment/cloud/), although I'm not sure if it will be that useful for your use-case. What is more annoying is the setting of task-level paths (working directory, stdout and stderr). In theory, it would be possible to just use relative paths for these (e.g. Currently, you're getting around this by using a directory that is accessible everywhere ( I think that we could resolve this by adding a new placeholder called e.g. By the way, HQ was mostly designed to run inside a single cluster (with a shared filesystem), as you have probably already noticed. While using it across clusters does partially work (and I know that some people are using it in that way), I wonder if perhaps a different solution for that might be better, e.g. something like Lexis (https://portal.beta.lexis.tech/login, https://www.e-infra.cz/file/76b003888693abdd9a06846a20cd5d00/1222/2_LEXIS_eInfraCZ_2024.pdf) or HEAppE (https://heappe.eu). In any case, let me know what do you think, or if you have any other suggestions/requests. |
Beta Was this translation helpful? Give feedback.
-
Goal of this disscusion thread:
Our usecase for HQ
We aim to create unifying layer on top of our shared compute resource provider, Metacentrum/CESNET and our on-prem HPC cluster.
Our cluster is firmly in our hands, and our users are provided with extensive scientific support. But we understand the benefits of letting our users expand to external providers, like Cesnet Metacentrum, IT4i, Lumi, etc, especially in cases when the application is stable and required tooling is already well understood by user, and we need only .. CPUs.
Without
hq
, users manually distribute their workloads between multiple HPC providers, writing both PBS and Slurm scripts .. and I think that is the right moment to bringhq
into game:Problem we observed
So lets submit one
hq
worker at metacentrum, and onehq
worker at our on-prem system, and submit some dummy job to them (eg.hostname
, orsleep 1
, etc. ). So taking the quick start .... and .. nope:My understanding of this problem
/home
implemented as symlink to different directories$HOME
env var of our metacentrum worker is different than$HOME
of our on-prem workereg. cluster skirit:
eg. cluster tarkil:
and we need to go deeper, because at least for
praha1
, that's another symlink, but that's already real directory, nice :)We can also see, that variable
${HOME}
differs across multiple clusters in federation, so if job gets alocation atkirke4.meta.zcu.cz
,$HOME
is/storage/plzen1/home/jose
where are indeed different files than in/storage/praha1
.That means, that we can not reliably use ${HOME} at all. So...
Workaround / solution:
in hq worker launchers, export
REALHOME
pointing to system-specific directory where ourapp.sh
is located:Metacentrum PBS worker batch script:
Koios (one of our on-prem systems) Slurm batch script:
submit
app.sh
job:Note that we use
aa.log
to stream the logs overhq
and setcwd
to/tmp
.Without setting cwd to
/tmp
the job fails due to/home
and$HOME
concept at Metacentrum.Beta Was this translation helpful? Give feedback.
All reactions