Skip to content

Conversation

KesterTan
Copy link

@KesterTan KesterTan commented Jan 26, 2025

Attempts to fix the copy-in issue that we're still facing even after retries.

  1. Captures invalid inputfiles structure
  2. Adds -p flag and permissions when creating autolab directory. This ensures that the specified directory and any necessary parent directories are created. If the directory already exists, the command does not raise an error, it would originally throw an error.
  3. Added thread pool to limit scp commands so that instance is not overloaded
  4. Added log statements with job id for more in-depth logging.

Ran a thousand jobs on https://dev.autolabproject.com/courses/test-course/jobs?id=500 without failure.

@KesterTan KesterTan marked this pull request as ready for review February 2, 2025 17:04
@KesterTan KesterTan requested a review from evanyeyeye February 2, 2025 17:04
@KesterTan KesterTan removed the request for review from evanyeyeye September 16, 2025 16:48
@KesterTan KesterTan requested review from a team, 20wildmanj, anthony-yip and dwang3851 and removed request for a team and 20wildmanj September 16, 2025 19:34
Copy link

@anthony-yip anthony-yip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but a lot are just nitpicks. Talk to me how you want to proceed. I've also fixed some of these in my own PRs. Also, you mentioned thread pool but I'm not sure which one you're referring to?

parser.add_argument("--accessKey", default="", help="AWS account access key content")
parser.add_argument("--instanceType", default="", help="AWS EC2 instance type")

parser.add_argument("--ec2", action="store_true", help="Enable ec2SSH VMMS")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you just check for whether config.VMMS_NAME == "ec2SSH"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I changed it to depend on an earlier argument for vmms

if args.notifyURL:
requestObj["notifyURL"] = args.notifyURL

if args.callbackURL:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .get()?

requestObj["disable_network"] = args.disableNetwork
requestObj["instanceType"] = args.instanceType
requestObj["ec2Vmms"] = args.ec2
requestObj["stopBefore"] = args.stopBefore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requestObj should be a dataclass

# resetting the free queue using the key doesn't change its content.
# Therefore we empty the queue, thus the free pool, to keep it consistent
# with the total pool.
tango.preallocator.machines.get(key)[1].make_empty()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to draw attention that this makes sense only because either both the dictionary and the queue are local (so you can modify in-place), or are both Redis (so modifying a transient copy generated by [1] modifies the redis object as they share the same hash).

For example, tango.preallocator.machines.get(key)[0].append(None) does nothing if we are using Redis.

self.lock.acquire()
if vm.name not in self.machines:
self.machines.set(vm.name, [[], TangoQueue(vm.name)])
self.machines.get(vm.name)[1].make_empty()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above regarding remote data structures


stopBefore = ""
if "stopBefore" in jobObj:
stopBefore = jobObj["stopBefore"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.get(), dataclass

self.log.debug("Error in notifyServer: %s" % str(e))

def afterJobExecution(self, hdrfile, msg, returnVM):
def afterJobExecution(self, hdrfile, msg, returnVM, killVM=True):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer detachVM over killVM, as "killVM" can still mean returning the VM to the free list

Copy link

@anthony-yip anthony-yip Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detachVM: "The worker must always call this function before returning" which will not be the case if using stopBefore.

consider setting vm.keep_for_debugging=True and still call detachVM?


# Thread exit after termination
self.detachVM(return_vm=returnVM)
if killVM:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that explains why you wouldn't want to killVM (to SSH into the autograding VM to debug it)

time.sleep(config.Config.TIMER_POLL_INTERVAL)

def copyIn(self, vm, inputFiles):
def copyIn(self, vm, inputFiles, job_id=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that job_id is only for debugging currently

self.log.debug("Waiting for VM")
if self.job.stopBefore == "waitvm":
msg = "Execution stopped before %s" % self.job.stopBefore
returnVM = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set returnVM to True when the killVM=False makes the returnVM argument meaningless?

Copy link

@dwang3851 dwang3851 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, left some small comments

def _clean(self):
self.__db.delete(self.key)

def make_empty(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not do self.__db.delete(self.key) here instead? How is this function different from _clean? I'm not super familiar with Redis but it seems wrong that we have to iterate here

tango.preallocator.machines.get(key)[1].make_empty()
jobs = JobManager(tango.jobQueue)

print("Starting the stand-alone Tango JobManager")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: log statement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants