Skip to content

How to run long simulations ( 24 hours) with qsub on gemini

Victor Onink edited this page Jul 29, 2020 · 1 revision

I have found that if you want to run a long simulation (>24 hours) using qsub, then my codes would often end up in the queue for at least 1.5 days (at which point I just deleted the job). The reason for this is that by default, if you use qsub, you end up submitting the job to the all.q queue, which seems to only like jobs up to 24 hours (and not a second more...). Instead, you want to submit it to the long.q queue, which is fine with jobs that are longer than 24 hours.

The problem with long.q is that it is only on the science-bs36 node, not on the science-bs35 (the node on which you start out on by default if you log into gemini). Therefore, if you want to run on lonq.q@science-bs36 and you have all your data on the scratch server, you need to make sure that you have a copy of all the relevant data on the science-bs36 scratch directory. So therefore this wiki to explain how you can do that!

How to submit a job specifically in the long.q@science-bs36 queue

I am assuming here that you have already logged in to gemini. Now, when you submit your job using qsub, you need to make sure that you have the command -q long.q@science-bs36 included. It is also important though that you specify exactly how much time you want for your run. Say I want 40 hours for my run, I would then also need to include -l h_rt=40:00:00. If your job is finished before that time, then everything is fine. If it turns out that your job needs more time and isn't done yet, it will still be ended after 40 hours irregardless (so its good to have a rough estimate of how long your run is going to take so you can fill in enough time).

How to run files in the long.q queue if you save files on scratch

Now, say that you have followed the steps described above and now you have submitted your job to be in the lonq.q queue. If you are making use of the scratch directory in any way in your code, you might now run into errors. The reason for this is that scratch is a local directory, in that there are seperate scratch directories on the science-bs35 and science-bs36 nodes. Therefore, if you have everything saved on the scratch directory on the science-bs35 node and you try to find those files on the science-bs36 node, you will be dissapointed and your code will return an error saying that the files can not be found. Therefore, make sure that you copy the relevant files to the science-bs36 scratch directory prior to submitting your job. Now you may ask, how would I do this? I am sure that there are many different ways (google is your friend), but this approach worked for me:

  1. Switch to the science-bs36 node by, once you have logged into gemini, typing ssh science-bs36
  2. Go to scratch directory by cd /scratch/whatever_directory_you_want_to_go_to
  3. To copy an entire directory from the other scratch file (inside which directory you of course have the files you need for your job to run smoothly), enter the command scp -r [username]@gemini.science.uu.nl:/scratch/_directory_you_want_to_copy . (Note that the dot at the end is not the end of a sentence here, do not forget the dot in your command otherwise it won't work!!!!!).
  4. Have a sufficient amount of patience depending on how many files your are copying (for example, copying all globcurrent files for 12 years a la 140 GB takes a while)

Now that you have all the files in the science-bs36 scratch directory, you should be fine to submit jobs in the long.q queue.

Output in the science-bs36 scratch folder?

It may be the case that you have set your code so that the output file is saved within the scratch directory. If you submit your job within the science-bs36 node, that means this will be within the bs36 scratch directory. If you want to get this to your home directory, this can be achieved quite easily by just copying the file to the home directory. However, if you are running a long simulation with a lot of output, the output file will likely be quite large and might not fit within the home directory. Instead, you'll just have to either keep the file within the scratch file or download it to your laptop. With downloading it to your laptop though, I have not gotten it so that I can download it directly from the science-bs36 scratch directory since it is unable to find said directory. Instead, it works to first copy the output file to the science-bs35 scratch directory (scp science-bs36:/scratch/path_to_file_here .) and from there you can then download it.