ospool | ||
---|---|---|
|
Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author.
This tutorial starts with the same set up as our Wordcount Tutorial for Submitting Multiple Jobs, but focuses on how to organize that example more effectively on the Access Point, with an eye to scaling up to a larger HTC workload in the future.
We can analyze one book by running the wordcount.py
script, with the
name of the book we want to analyze:
$ ./wordcount.py Alice_in_Wonderland.txt
Try running the command to see what the output is for the script. Once you have done that
delete the output file created (rm counts.Alice_in_Wonderland.txt
).
We want to run this script on all the books we have copies of.
- What is the input set for this HTC workload?
- What is the output set?
Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the access point?
There will also be system and HTCondor files produced when we submit a job -- how would you organize the log, standard error and standard output files?
Try making those changes before moving on to the next section of the tutorial.
There are many different ways to organize files; a simple example that works for most workloads is having a directory for your input files and a directory for your output files. We can set up this structure on the command line by running:
$ mkdir input
$ mv *.txt input/
$ mkdir output/
We can view our current directory and its subdirectories by using the recursive flag
with the ls
command:
$ ls -R
README.md books.submit input output wordcount.py
./input:
Alice_in_Wonderland.txt Huckleberry_Finn.txt Ulysses.txt
Dracula.txt Pride_and_Prejudice.txt
./output:
We are also going to create directories for the HTCondor log files and the standard error and standard output files (in one directory):
$ mkdir logs
$ mkdir errout
Now we want to submit a test job that uses this organizing scheme, using just
one item in our input set -- in this example, we'll use the Alice_in_Wonderland.txt
file from our input/
directory. The lines that need to be filled in are shown
below and can be edited using the nano
text editor:
$ nano books.submit
executable = wordcount.py
arguments = Alice_in_Wonderland.txt
transfer_input_files = input/Alice_in_Wonderland.txt
transfer_output_files = counts.Alice_in_Wonderland.txt
transfer_output_remaps = "counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt"
Note that to tell HTCondor the location of the input file, we need to include
the input directory. We're also using a submit file option called
transfer_output_remaps
that will essentially move the output file to our
output/
directory by renaming or remapping it.
We also want to edit the submit file lines that tell the log, error and output files where to go:
$ nano books.submit
output = logs/job.$(ClusterID).$(ProcID).out
error = errout/job.$(ClusterID).$(ProcID).err
log = errout/job.$(ClusterID).$(ProcID).log
Once you've made the above changes to the books.submit
file, you can submit it,
and monitor its progress:
$ condor_submit books.submit
$ condor_watch_q
(Type CTRL
-C
to stop the condor_watch_q
command.)
We are now sufficiently organized to submit our whole workload.
First, we need to create a file with our input set -- in this case, it will be a list of the
book files we want to analyze. We can do this by using the shell's listing command ls
and
redirecting the output to a file:
$ cd input
$ ls > booklist.txt
$ cat booklist.txt
$ mv booklist.txt ..
$ cd ..
Then, we modify our submit file to reference this input list and replace the static values
from our test job (Alice_in_Wonderland.txt
) with a variable -- we've chosen $(book)
below:
$ nano books.submit
executable = wordcount.py
arguments = $(book)
transfer_input_files = input/$(book)
transfer_output_files = counts.$(book)
transfer_output_remaps = "counts.$(book)=output/counts.$(book)"
# other options
queue book from booklist.txt
Once this is done, you can submit the jobs as usual:
$ condor_submit books.submit