This is a watchme repository that shows how easy it is to monitor a task at some frequency using the watchme monitor pid task provided by the psutils set of tasks. Specifically, we are going to:
- Start with this sklearn mnist example
- Build it into a container, the Dockerfile here served at vanessa/watchme-mnist
- Run the container on an HPC cluster with varying amounts of memory, for a training task that takes approximately 20 minutes.
And compare results!
This is a fairly simple analysis in that I could install watchme and then write a few quick scripts, run, and be done!
- run_job.sh will submit job.sh to the cluster, specifying input parameters and outputs
- job.sh is submit to different nodes with varying memory, each 5 times
- data is where output data is written to, including json results files and images from the training.
Specifically, to install watchme:
$ pip install watchme[all]
You can also clone and install from the master branch directly:
$ git clone https://www.github.com/vsoch/watchme
cd watchme
pip install .[all] --user
And then I created a watcher folder (this repo).
$ watchme create watchme-mnist
We aren't going to be using .git as a temporal database, but it's still handy to use watchme to create the repo for us :)
This was the script job.sh submit via run_job.sh and we first export some variables to the environment to be added to our data:
# Add variables for host, cpu, etc.
export WATCHMEENV_HOSTNAME=$(hostname)
export WATCHMEENV_NPROC=$(nproc)
export WATCHMEENV_MAXMEMORY=${mem}
and the command to use watchme looks like this. We are going to run the model and record every 20 seconds. The output will be piped into a json file, and the script is given the name of a png file (in the same directory) to save a plot to. This should take 20-30 mins.
watchme monitor --name $name-$iter --seconds 20 singularity run docker://vanessa/watchme-mnist ${output}.png > ${output}.json
The above command is submit in a simple loop in run_job.sh, notice how we define iter, and mem based on the loops:
for iter in 1 2 3 4 5; do
for mem in 4 6 8 12 16 18 24 32 64 128; do
output="${outdir}/${name}-iter${iter}-${mem}gb"
echo "sbatch --mem=${mem}GB job.sh ${mem} ${iter} ${name} ${output}"
sbatch --mem=${mem}GB job.sh "${mem}" "${iter}" "${name}" ${output}
done
done
The results were each written directly to files in data (not using git as a temporal database).