Utilities to set up and launch Amber MD “swarms” (many independent trajectories) on Slurm-based clusters.
Two launch modes are supported:
- array — one job array with
Narray tasks (one trajectory per task). - alloc — allocate nodes and launch
Nper-trajectory tasks yourself (one srun per trajectory).
The scripts auto-detect the cluster via Slurm’s ClusterName and load the right Amber stack. Supported clusters are scu, cayuga, delta, and frontier. If the cluster is not one of these, the run will abort (no silent fallback).
You run
./launch_swarm.sh(notsbatch). It submits the real jobs and wires up dependencies for chained subjobs per trajectory.
- Repository layout
- Prerequisites
- Cluster autodetect & Amber backends
- Quick start
- Configuration —
launch_swarm.sh - Run modes
- How chaining & restarts work
- Directory conventions & inputs
- Logs & outputs
- Safe cleanup (trash) behavior
- Setting wall time
- Troubleshooting
- FAQ
Amber-Ensemble/
├─ launch_swarm.sh # submit orchestrator (array or alloc)
├─ submit_swarm_subjobs.sh # per-trajectory runner (does safe cleanup & srun)
├─ setup_individual_swarm.sh # helper to make per-traj directory scaffolding
├─ run_amber.sh # cluster-aware wrapper that runs pmemd.{cuda,hip}
├─ inputs/ # your input bundles (topology, mdin, rst7, etc.)
└─ raw_swarms/ # generated swarms + logs
├─ submission_logs/ # Slurm stdout for submission steps
└─ swarm0000/ # one directory per swarm
├─ swarm0000_traj0000/ # one directory per trajectory
├─ swarm0000_traj0001/
└─ ...
- A Slurm cluster with GPUs and a working Amber install that provides one of:
pmemd.cuda(NVIDIA backends) orpmemd.hip(AMD ROCm backends).
- Your account / allocation and partition names.
- The
run_amber.shscript has module and environment setup for the supported clusters (see below). If your site differs, edit that script’s case block.
No
~/.bashrcedits are required. All environment is set inside the job viarun_amber.sh.
run_amber.sh calls scontrol show config | awk -F= '/ClusterName/ {print $2}' to detect the cluster and then loads site-specific modules / Amber env:
-
frontier
module load ums/default module load ums002/default module load openmpi/4.1.5 module load rocm/5.3.0 source /ccs/proj/bip109/frontier/amber22_gcc11.2/amber22_src/dist/amber.sh # Uses: pmemd.hip
-
scu
source /software/apps/amber/amber24/pmemd24/amber.sh module load cuda/12.2.1-gcc-8.2.0-23runmx module load openmpi/4.1.5 # Uses: pmemd.cuda
-
cayuga
module load amber/24 # Uses: pmemd.cuda -
delta
(Set according to your A100/MI100 partition; default is CUDA path.)# Typical CUDA stack on A100 partitions module load cuda module load openmpi/4.1.x # Uses: pmemd.cuda
If the cluster name is not one of the above, the script exits with an error. Add a new case to run_amber.sh if you need another site.
-
Download code
Clone a copy of this repository, in a directory that is appropriate for running swarms of MD simulations.
cd wherever_you_wish_to_run git clone git@github.com:weinsteinlab/Amber-Ensemble.gitOptionally, you can rename the cloned repository to something meaningful for your calculations
mv Amber-Ensemble my_Amber_ensembleOnce this is done, go into this directory:
cd my_Amber_ensemble # or whatever the directory is named at this pointNote: directory is hereafter referred to as the parent directory
-
Stage inputs
Note: You'll see there is another directory called./common: do NOT edit anything in this directory! It contains a script that facilitate job management and should not be edited by the user.
./inputs should contain a separate subdirectory for each unique system you wish to run. These subdirectories MUST have 4-zero-padded, zero-indexed names (e.g., 0000, 0001, 0002, etc.). Deviating from this nomenclature WILL break the scripts.
Note: This workflow assumes you are RESTARTING Amber simulations (i.e., you have already built/equilibrated the system with CHARMM-GUI).
Each subdirectory must contain all of the simulation system-specific files needed to simulate your system with Amber 22:
-
parm.parm7: this protein structure file possesses model structural information (bond connectivity, etc.). The file can be named anything, but must end in .parm7, and cannot be a binary file. Note: this file also contains toppar information, so extra toppar files are not needed.
-
swarm0000_subjob0000.rst7: restart coordinates for your system. Use this name.
-
input.mdin: this input script defines the Amber22 simulation (i.e., how it is run); it MUST be named input.mdin. Here, the statistical ensemble is selected (e.g. NPT), temperature, and many, many other simulation parameters. The settings are commented well, so it should be clear. Be sure to consider temperature (temp0), coordinate temporal resolution (ntwx), and other physics related-settings.
-
numberOfReplicas.txt: this file contains to number of replicas to run for this system. MUST be a multiple of 8.
-
Optional: You can also include
swarm0000_subjob0000.mdinfo,swarm0000_subjob0000.mdout,swarm0000_subjob0000.nc, but this is just for convenience (these specific files are not used in the running of this ensemble job).
Note: make sure you have benchmarked each different system and have adjusted its individual steps= parameter accordingly. This workflow supports running an arbitrarily high number of systems (up to 9,999) with no restrictions on size differences. However, this functionality relies on adjusting each systems nstlim= to what can run in the job's time limit.
Note: I recommend only requesting 80% of the number of steps that can be performed in the job time limit. This way, there is little risk of any of the systems running out of time, creating a mess to clean up.
VERY IMPORTANT: input.mdin only contains ensemble-related information. All descriptions of input files are automatically understood by what is present in each subdirectory. Do NOT describe input files in this file, or the scripts will break.
Finally, if you only have 1 system to run (with many replicas), just create 1 subdirectory in inputs.
-
Build the swarm directory
Usesetup_individual_swarm.shto scaffoldraw_swarms/swarm0000/and itstraj####subdirs frominputs/. (Edit its top few variables if needed.) -
Edit
launch_swarm.sh(user config at top)mode—array(recommended) orallocswarm_number— integer (e.g., 0)n_trajs_per_swarm— number of trajectoriesn_jobs_per_traj— subjobs chained per trajectorygpus_per_replica— GPUs per trajectorytrajectories_per_node(alloc mode only) — packing factor per nodeaccount,partition,time_limit, optionalarray_max_parallel,extra_sbatch_flags
-
Launch
./launch_swarm.sh
Watch the ID(s) and monitor with
squeue -u $USER.
Most behavior is set in the USER CONFIG block at the top of launch_swarm.sh:
mode="array"| "alloc"swarm_number=0n_trajs_per_swarm=4n_jobs_per_traj=2gpus_per_replica=1trajectories_per_node=8(alloc only;nodes = ceil(trajs / trajectories_per_node))account,partition,time_limit="HH:MM:SS"gpus_per_node(needed for array mode on some clusters to ensure full-node allocation)array_max_parallel=""(e.g.,8to throttle a big array)extra_sbatch_flags=( )(add site-specific flags here)
The wall clock for each submitted job is
time_limit(see Setting wall time).
launch_swarm.sh writes submission logs to raw_swarms/submission_logs/ (one per sbatch call).
- Submits one sbatch with
--array=0-(N-1)%P(ifarray_max_parallelis set). - Each array task runs one trajectory, and
submit_swarm_subjobs.shdetermines the next subjob index for that trajectory, then callsrun_amber.shundersrun.
Pros: simplest scheduling; Slurm handles per-trajectory placement.
Note: This is not an option for Frontier; use alloc mode instead. All other clusters currently supported (scu-login01, cayuga, delta-gpu) support array mode.
- Submits one sbatch per chained subjob, asking for a multi-node allocation sized by:
nodes = ceil(n_trajs_per_swarm / trajectories_per_node) tasks_per_node ≈ trajectories_per_node (bounded by n_trajs_per_swarm) - Inside the allocation, the script launches one
srunper trajectory concurrently.
If sbatch returns Requested node configuration is not available, see Troubleshooting.
Within each trajectory directory (raw_swarms/swarm####/swarm####_traj####/), the runner:
- Detects the next subjob index by scanning for the last
*subjob####*.mdinfofile.- If none exist yet, it initializes bookkeeping to start at subjob 0000.
- Restart guard: if the previous attempt did not end with
FINISHEDon the last line ofamber_run.log, it will:- decrement to the previous subjob index,
- write a marker file
subjob_####_restarted, - safely move any files from that subjob (see next section) to
./trash/, - then re-run that subjob cleanly.
- Runs the current subjob via
sruncallingrun_amber.sh.
Amber standard out goes toamber_log.txtin the trajectory dir.
Each swarm####_traj####/ directory is expected to contain at least:
*.parm7*.mdin(your production or staged input)- a restart/coordinate file whose name matches the subjob naming used by your workflow (e.g.,
*subjob0000.rst7for the very first run).
run_amber.sh names outputs with the convention that includes _subjob#### in the basename, so the subsequent detection of *.mdinfo works automatically.
- Submission logs:
raw_swarms/submission_logs/${jobname}_slurm-<jobid>.out - Per-trajectory:
amber_log.txt(stdout of the currentrun_amber.sh),*.mdout,*.mdinfo,*.rst7,*.nc, etc. - Restart markers:
subjob_####_restarted - Bookkeeping:
amber_run.log(last lineFINISHEDindicates a clean end)
When a subjob needs to be re-run, the script moves all files for that subjob index with extensions in:
mdout, mdinfo, rst7, nc
…into a flat ./trash/ folder inside the trajectory directory. Subsequent moves overwrite any same-named files already in trash. Nothing is permanently deleted by this script. To reclaim space, manually remove files from trash when you’re ready. The use of a trash folder is to guard against unforseen edge cases (i.e., to prevent the scripts from actually deleting files when restarting bad subjobs).
Edit time_limit="HH:MM:SS" near the top of launch_swarm.sh. That value is passed directly to sbatch -t for every job launch (both array and alloc modes).
Your -N, --ntasks(-per-node), --gpus-per-node, and partition constraints don’t match what that partition can deliver. Fixes:
- Try the array mode first — it’s the least restrictive.
- For alloc mode, ensure
trajectories_per_node × gpus_per_replica <= GPUs per nodeon that partition, and thatnodes = ceil(n_trajs / trajectories_per_node)is legal for the chosen partition/QOS. - Some “interactive” partitions disallow multi-node or certain shapes. Use the standard batch GPU partition for queued work.
Q: How do I throttle a giant job array?
Set array_max_parallel=8 (or your limit). You’ll get --array=0-(N-1)%8.
Q: How many nodes are requested in alloc mode?
nodes = ceil(n_trajs_per_swarm / trajectories_per_node). Example: 20 trajectories with trajectories_per_node=8 → 3 nodes.
Q: Where do amber stdout/stderr go?
To amber_log.txt in each trajectory dir. Slurm submission output goes to raw_swarms/submission_logs/.
Q: Which Amber engine is used?
Per-cluster in run_amber.sh: pmemd.cuda on NVIDIA sites; pmemd.hip on Frontier (ROCm).
Scripts by Derek M. Shore, PhD.