git clone <url>/hspn_surrogate_models
cd hspn_surrogate_models
uv sync # or pip install -e .Note
On some systems you may need to set higher timeout and retries if you get sync errors. E.g., UV_HTTP_TIMEOUT=120 UV_HTTP_RETRIES=6
Used to preprocess data and create an H5 dataset for use by the models.
hspn-prepare data_dir=./data branch_files=[f_total.npy] trunk_files=[xyz.npy] output_files=[y_total.npy] output_path=./data/don_dataset.h5Note: There are more options, use
--cfg=jobto see them and read the the CLI documentation below to learn how to use this CLI.
Corresponds to the structure:
data/
| f_total.npy
| xyz.npy
| y_total.npy
| don_dataset.h5 # createdhspn-trainNote: There are more options, use
--cfg=jobto see them and read the the CLI documentation below to learn how to use this CLI.
First, build the apptainer image with make hspn.sif
Next, parameterize a sweep by editing a configuration file (see train_hpo*.yaml for examples)
Finally, launch...
ACCT=XXXXXXXX cluster/hpo-pbs.shSee the PBS launch script for documentation on configuration options.
sbatch --account=XXXXXXXX cluster/hpo.slurm [<args>]See the SLURM batch script for documentation on configuration options. Args can be passed to the train task as usual e.g.,
sbatch --account=XXXXXXXX cluster/hpo.slurm comm_backend=gloo n_epochs=100The following applies to all CLI applications in hspn.
To see all available options:
# hspn-cli is a stand-in for any hspn cli invocation
hspn-<train/prepare/etc> ---help
hspn-<train/prepare/etc> --cfg=job # or --cfg=allIt is recommended to check the final config the job will execute with before running:
hspn-<train/prepare/etc> --cfg=job # or --cfg=all for verbose information
hspn-<train/prepare/etc> --cfg=job --resolve # causes variable references in the config to be resolved (resolving is always done at runtime, so this shows the final resolved config the job will use)For interactive experimentation it is recommended take advantage of shell completion which can be installed with:
hspn-<train/prepare/etc> --shell-completion install=<bash/zsh/fish>
# for a useful shorthand version:
hspn-<train/prepare/etc> -sc install=$(basename $SHELL)To install train and prepare (could be placed in ~/.zshrc/~/.bashrc/etc):
eval "$(hspn-train -sc install=$(basename $SHELL))"
eval "$(hspn-prepare -sc install=$(basename $SHELL))"Now, you can get autocomplete while setting configuration options. Try:
hspn-train model.<TAB><TAB>Note: depending on your machine completion may lag a bit.
Run a task that requires a dependency group,
uv run --extra gnn -m hspn.train_gnnIf you encounter an error such as:
ValueError: CategoricalDistribution does not support dynamic value space
This is likely because there is an Optuna DB persisted to disk (e.g., Redis) that already has a study with the same name you are using. You have changed the search space (rather than just resuming the study) and now there is a mismatch.
There are a few ways to address it,
- Use a different study name either in the config file, at the CLI, or with the environment variable
OPTUNA_STUDY_NAMEwhich will get passed through to the config which has something likestudy_name: ${oc.env:STUDY_NAME} - Delete the old study
- Delete the entire database if you just want to start over (may be at
.redis/depending on the configuration, check launch script if not) - Dont use the disk-persisted Optuna DB. This feature is optional and not vital to run Optuna sweeps. The example
train_hpo_optuna.yamlsetshydra.sweeper.storage=${oc.env:OPTUNA_STORAGE_URL,null}whereOPTUNA_STORAGE_URLis set at launch. If this value isnullthen an in-memory store is used and not persisted to disk.
If you encounter a build error such as:
No space left on device
Build in a sandbox with --sandbox then convert the sandbox to an image with apptainer build image.sif image.sif/
For example, instead of
# Standard build:
apptainer build --fakeroot --bind $(pwd):/workspace hspn.sif cluster/hspn.defUse a sandbox,
# Sandbox build:
apptainer build --fakeroot --bind "$(pwd):/workspace" --sandbox hspn.sif/ cluster/hspn.def
apptainer build --fakeroot hspn.sif hspn.sif/Apptainer/Singularity does not implement layer caching like Docker so having a persitent sandbox may be of interest to help build time during development. For a persistent sandbox, simply name it something else:
# Persistent sandbox build
apptainer build --fakeroot --bind "$(pwd):/workspace" --sandbox hspn.sandbox/ cluster/hspn.def
apptainer build --fakeroot hspn.sif hspn.sandbox/