Skip to content

Add new user-focused job submission and management tutorial #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 28, 2020

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented Oct 11, 2019

This PR adds a new job submission and management tutorial that is designed for new users, rather than administrators who are already proficient with HTCondor.

@JoshKarpel JoshKarpel added the enhancement New feature or request label Oct 11, 2019
@JoshKarpel JoshKarpel requested a review from bbockelm October 11, 2019 17:50
@JoshKarpel JoshKarpel self-assigned this Oct 11, 2019
@JoshKarpel JoshKarpel changed the base branch from stable to master October 22, 2019 17:02
@ChristinaLK
Copy link

let me know if I should reformat my email comments as line comments on this PR. ;)

@JoshKarpel
Copy link
Contributor Author

JoshKarpel commented Nov 25, 2019

Christina's email comments copied below :)


Setup and start was super easy although a) I already had docker installed and b) only the last link (running on 127.0.0.1) worked.

Some thoughts:

  • it was NOT obvious to me (although maybe that's because I was in a docker container + not logging into a submit server) that I was submitting jobs from the folder with the notebook in them and that's where the output files are. I think it would be useful to say something like "look in the file browser and you can see the files produced by the job!" You could have people check the .out file of the hostname job to confirm it matches what python is telling you (which - python is reading the file, but I still think it's useful to see how to access the same info in two ways).
  • Similarly, I think having people open up a shell and run "condor_q" during the sleep jobs would be a great way of showing that this is doing something they understand, just with a different syntax (I'm assuming here, that this is for people who have already used the command line tools.)
  • Transactions are confusion. I would either explain more or add a caveat that there's more going on there that it's not important to understand.
  • Using "sub" as the name of the submit description all three times is not clear to me. "sub.queue()" in particular doesn't distinguish the variable (sub, defined by you) from the method (defined by the htcondor module). I would probably do something like "hostjob", "sleepjob", etc. to distinguish the different submit descriptions. And I know I agreed with you that the submit description is different than the job, but in this case I think the naming of "job" is better than "sub". /shrug
  • This syntax is a bit awkward but I don't think there's anything you can do about it?
    constraint='ClusterId=={}'.format(cluster_id) --> instead of constraint='ClusterId=='cluster_id''
  • Maybe explain what happens if you don't use "attr" with the query method? And provide a list of common attributes for people to experiment with.
  • The 1 in "queuewithitemdata()" is confusing. Can you just leave it out and do "queuewithitemdata(txn, itemdata=iter(itemdata))" or whatever the argument name should be?
  • the random files in the "inputs" folder failed because the folder didn't exist and the python didn't create it. I created it in the jupyter lab interface and then everything went smoothly.

In general: it would be awesome to have more exercises or even suggestions for things to modify and try as you go through. Otherwise you just click through. It could be simple things --

  • modify the submit description and submit again.
  • Try different attrs for HTCondor queue.
  • How would you submit a job that printed a message using echo? etc.
  • Having the schedd.query() block after every job description would be valuable (I think), or, set it up and have people finish and then run it.

Unrelated: somewhere make clear that when you close the browser/container, you lose all your progress you made in that notebook?

@JoshKarpel
Copy link
Contributor Author

Christina's email comments copied below :)


Setup and start was super easy although a) I already had docker installed and b) only the last link (running on 127.0.0.1) worked.

Some thoughts:

  • it was NOT obvious to me (although maybe that's because I was in a docker container + not logging into a submit server) that I was submitting jobs from the folder with the notebook in them and that's where the output files are. I think it would be useful to say something like "look in the file browser and you can see the files produced by the job!" You could have people check the .out file of the hostname job to confirm it matches what python is telling you (which - python is reading the file, but I still think it's useful to see how to access the same info in two ways).
  • Similarly, I think having people open up a shell and run "condor_q" during the sleep jobs would be a great way of showing that this is doing something they understand, just with a different syntax.
  • (I'm assuming here, that this is for people who have already used the command line tools.)
  • Transactions are confusion. I would either explain more or add a caveat that there's more going on there that it's not important to understand.
  • Using "sub" as the name of the submit description all three times is not clear to me. "sub.queue()" in particular doesn't distinguish the variable (sub, defined by you) from the method (defined by the htcondor module). I would probably do something like "hostjob", "sleepjob", etc. to distinguish the different submit descriptions. And I know I agreed with you that the submit description is different than the job, but in this case I think the naming of "job" is better than "sub". /shrug
  • This syntax is a bit awkward but I don't think there's anything you can do about it?
    constraint='ClusterId=={}'.format(cluster_id) --> instead of constraint='ClusterId=='cluster_id''
  • Maybe explain what happens if you don't use "attr" with the query method? And provide a list of common attributes for people to experiment with.
  • The 1 in "queuewithitemdata()" is confusing. Can you just leave it out and do "queuewithitemdata(txn, itemdata=iter(itemdata))" or whatever the argument name should be?
  • the random files in the "inputs" folder failed because the folder didn't exist and the python didn't create it. I created it in the jupyter lab interface and then everything went smoothly.

In general: it would be awesome to have more exercises or even suggestions for things to modify and try as you go through. Otherwise you just click through. It could be simple things --

  • modify the submit description and submit again.
  • Try different attrs for HTCondor queue.
  • How would you submit a job that printed a message using echo? etc.
  • Having the schedd.query() block after every job description would be valuable (I think), or, set it up and have people finish and then run it.

Unrelated: somewhere make clear that when you close the browser/container, you lose all your progress you made in that notebook?

@JoshKarpel
Copy link
Contributor Author

@ChristinaLK

it was NOT obvious to me (although maybe that's because I was in a docker container + not logging into a submit server) that I was submitting jobs from the folder with the notebook in them and that's where the output files are. I think it would be useful to say something like "look in the file browser and you can see the files produced by the job!" You could have people check the .out file of the hostname job to confirm it matches what python is telling you (which - python is reading the file, but I still think it's useful to see how to access the same info in two ways).

I added some language in the "Submitting a Simple Job" section about this.

Similarly, I think having people open up a shell and run "condor_q" during the sleep jobs would be a great way of showing that this is doing something they understand, just with a different syntax (I'm assuming here, that this is for people who have already used the command line tools.)

At least in my mind, the goal is to present an isolated picture. You or @bbockelm might want to override me on that though :)

Transactions are confusion. I would either explain more or add a caveat that there's more going on there that it's not important to understand.

I definitely don't think we should explain more in this tutorial, so I noted that it's just boilerplate and that it isn't important to understand the mechanics of it.

Using "sub" as the name of the submit description all three times is not clear to me. "sub.queue()" in particular doesn't distinguish the variable (sub, defined by you) from the method (defined by the htcondor module). I would probably do something like "hostjob", "sleepjob", etc. to distinguish the different submit descriptions. And I know I agreed with you that the submit description is different than the job, but in this case I think the naming of "job" is better than "sub". /shrug

I like this, and changed all the names to *_job. I'm perpetually torn on whether to call them job descriptions, because I don't know that users care. For now, I think this is fine.

This syntax is a bit awkward but I don't think there's anything you can do about it constraint='ClusterId=={}'.format(cluster_id) --> instead of constraint='ClusterId=='cluster_id''

If we were on Python 3.7+ we could do f'ClusterId == {cluster_id}', but we can't guarantee that everyone who wants to use the tutorial is on a modern Python. The .format() method works on 2.7, so I think we need to use it.

Maybe explain what happens if you don't use "attr" with the query method? And provide a list of common attributes for people to experiment with.

See end.

The 1 in "queuewithitemdata()" is confusing. Can you just leave it out and do "queuewithitemdata(txn, itemdata=iter(itemdata))" or whatever the argument name should be?

Yes, we can! There was some kerfuffle with the ticket causing regressions, but I had forgotten that this part did make it in (https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7064).

the random files in the "inputs" folder failed because the folder didn't exist and the python didn't create it. I created it in the jupyter lab interface and then everything went smoothly.

This was a plain bug; I needed to tell shutil.rmtree to ignore errors. Should work as expected now without needing to make the directory yourself.


In general: it would be awesome to have more exercises or even suggestions for things to modify and try as you go through. Otherwise you just click through. It could be simple things --

  • modify the submit description and submit again.
  • Try different attrs for HTCondor queue.
  • How would you submit a job that printed a message using echo? etc.
  • Having the schedd.query() block after every job description would be valuable (I think), or, set it up and have people finish and then run it.

I like this idea, but how about we do it as an "exercises" section at the bottom? We could have a few exercises with expected output embedded in markdown cells, and even provide some input files to work with. Things like

  • Write a submit description that submits jobs using these inputs files in a certain way
  • Write a query that gets this information from some jobs and does something with it

Thoughts?

@ChristinaLK
Copy link

I like the idea of an exercises section at the bottom!

Will leave more comments/ideas when I have time.

@JoshKarpel
Copy link
Contributor Author

I added three exercises, based on some of the ideas above. I didn't think too hard about how hard they are, but I think the second one is probably too difficult for this tutorial without knowing a lot more about the classad language.

@ChristinaLK
Copy link

I'll re-add this to my queue of things to review.

@JoshKarpel
Copy link
Contributor Author

@ChristinaLK I'm gonna go ahead and merge this. Let's keep this in mind and review later though!

@JoshKarpel JoshKarpel merged commit 71cf611 into master Jan 28, 2020
@JoshKarpel JoshKarpel deleted the new-user-tutorial branch February 21, 2020 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants