Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "trainremote" command #906

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BillyCheung10botics
Copy link
Contributor

This commit adds a new command "trainremote" that allows users to train their ai model on our AWS server. Currently, the training is hardcoded to train a KerasLinear model with max epoch of 30. You can use this command as the following example shown,

donkey trainremote --tub ~/mycar/data/tub_4_21-06-11/ ~/mycar/data/tub_3_21-06-11/ --path ~/mycar --model modelname 
--url https://hq.robocarstore.com/train/submit_job --get https://hq.robocarstore.com/train/refresh_job_statuses

with mandatory arguments --tub, --path and --model,, and optional arguments --url and --get

  • --tub: specify the path(s) of your data tub(s)

  • --path: specify the path of your mycar

  • --model: specify the file name (without extension) of the resulting model, loss function plot and movie

  • --url: the url to post a training request

  • --get: the url to get the status of the submitted job

After submitting a job via the "trainremote", the training takes place on our AWS server with a max. duration of 15 min. You can view your submitted job (and others' jobs) on https://hq.robocarstore.com/train/. Usually, the training takes around 10 min to complete, a model file, a plot and a movie are then downloaded from our server, and saved to the folder 'models' and 'movies' respectively.

@DocGarbanzo
Copy link
Contributor

@BillyCheung10botics - this is a great addition. I have three comments:

  1. Can you make the —path optional as we have in the train and other commands?
  2. Can you please convert print statements to logging?
  3. The checks are failing, maybe the package is not available in our conda channels, in that case they should go under pip. Can you check that please?

@DocGarbanzo
Copy link
Contributor

@BillyCheung10botics - can you have a look at addressing the installation issues from the checks above, and a couple of minor comments? It would be great to have that change in Donkey Car.

'donkeycar_version': str(__version__)
}
)
print(f"URL submitted to: {submit_job_url}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please switch to logging statements instead of print, that's what we are using now.

# The MultipartEncoder provides the content-type header with the boundary:
headers={'Content-Type': mp_encoder.content_type}
)
print(f"Submission response: HTTP {r.status_code}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

if ("job_uuid" in r.json()):
try:
uuid = r.json()['job_uuid']
print(f"Submitted a Training Job to the remote server {submit_job_url}\n uuid: {uuid}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

print(f"Submitted a Training Job to the remote server {submit_job_url}\n uuid: {uuid}")
return uuid
except Exception as e:
print(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

def get_latest_job_status_from_hq(self, refresh_job_statuses, job_uuids):
import requests

print(f"Getting lastest job status for uuid {job_uuids}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

while checkagain and (run_time < self.TIMEOUT):
result = self.get_latest_job_status_from_hq(url_status, uuid)[0]
checkagain = result['status'] == "SCHEDULED" or result['status'] == "TRAINING"
print(f"Training Status:{result['status']} at {datetime.now().strftime('%Y-%m-%d, %H:%M:%S')}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

raise Exception(f"Failed to train the submitted job\nError : {result['status']}")

run_time = time.time()-start_time
print(f"Time spent: {run_time} s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

@@ -33,6 +33,9 @@ dependencies:
- psutil
- kivy=2.0.0
- plotly
- requests
- requests_toolbelt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can't find this package in conda main and conda forge channels, maybe it's only available in pip?

@@ -35,6 +35,9 @@ dependencies:
- kivy=2.0.0
- plotly
- tensorflow==2.2.0
- requests
- requests_toolbelt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to produce the same problem as in the Mac conda yaml file.

@@ -34,6 +34,9 @@ dependencies:
- kivy=2.0.0
- plotly
- psutil
- requests
- requests_toolbelt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea if this works in windows... do you have any means to check that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants