-
Notifications
You must be signed in to change notification settings - Fork 62
Copy in #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ec2-new-implementation
Are you sure you want to change the base?
Copy in #270
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but a lot are just nitpicks. Talk to me how you want to proceed. I've also fixed some of these in my own PRs. Also, you mentioned thread pool but I'm not sure which one you're referring to?
parser.add_argument("--accessKey", default="", help="AWS account access key content") | ||
parser.add_argument("--instanceType", default="", help="AWS EC2 instance type") | ||
|
||
parser.add_argument("--ec2", action="store_true", help="Enable ec2SSH VMMS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just check for whether config.VMMS_NAME == "ec2SSH"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I changed it to depend on an earlier argument for vmms
if args.notifyURL: | ||
requestObj["notifyURL"] = args.notifyURL | ||
|
||
if args.callbackURL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .get()?
requestObj["disable_network"] = args.disableNetwork | ||
requestObj["instanceType"] = args.instanceType | ||
requestObj["ec2Vmms"] = args.ec2 | ||
requestObj["stopBefore"] = args.stopBefore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requestObj should be a dataclass
# resetting the free queue using the key doesn't change its content. | ||
# Therefore we empty the queue, thus the free pool, to keep it consistent | ||
# with the total pool. | ||
tango.preallocator.machines.get(key)[1].make_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to draw attention that this makes sense only because either both the dictionary and the queue are local (so you can modify in-place), or are both Redis (so modifying a transient copy generated by [1] modifies the redis object as they share the same hash).
For example, tango.preallocator.machines.get(key)[0].append(None) does nothing if we are using Redis.
self.lock.acquire() | ||
if vm.name not in self.machines: | ||
self.machines.set(vm.name, [[], TangoQueue(vm.name)]) | ||
self.machines.get(vm.name)[1].make_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above regarding remote data structures
|
||
stopBefore = "" | ||
if "stopBefore" in jobObj: | ||
stopBefore = jobObj["stopBefore"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get(), dataclass
self.log.debug("Error in notifyServer: %s" % str(e)) | ||
|
||
def afterJobExecution(self, hdrfile, msg, returnVM): | ||
def afterJobExecution(self, hdrfile, msg, returnVM, killVM=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer detachVM over killVM, as "killVM" can still mean returning the VM to the free list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detachVM: "The worker must always call this function before returning" which will not be the case if using stopBefore.
consider setting vm.keep_for_debugging=True and still call detachVM?
|
||
# Thread exit after termination | ||
self.detachVM(return_vm=returnVM) | ||
if killVM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that explains why you wouldn't want to killVM (to SSH into the autograding VM to debug it)
time.sleep(config.Config.TIMER_POLL_INTERVAL) | ||
|
||
def copyIn(self, vm, inputFiles): | ||
def copyIn(self, vm, inputFiles, job_id=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that job_id is only for debugging currently
self.log.debug("Waiting for VM") | ||
if self.job.stopBefore == "waitvm": | ||
msg = "Execution stopped before %s" % self.job.stopBefore | ||
returnVM = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set returnVM to True when the killVM=False makes the returnVM argument meaningless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, left some small comments
def _clean(self): | ||
self.__db.delete(self.key) | ||
|
||
def make_empty(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not do self.__db.delete(self.key) here instead? How is this function different from _clean? I'm not super familiar with Redis but it seems wrong that we have to iterate here
tango.preallocator.machines.get(key)[1].make_empty() | ||
jobs = JobManager(tango.jobQueue) | ||
|
||
print("Starting the stand-alone Tango JobManager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: log statement
Attempts to fix the copy-in issue that we're still facing even after retries.
Ran a thousand jobs on https://dev.autolabproject.com/courses/test-course/jobs?id=500 without failure.