-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "trainremote" command #906
base: main
Are you sure you want to change the base?
Conversation
@BillyCheung10botics - this is a great addition. I have three comments:
|
@BillyCheung10botics - can you have a look at addressing the installation issues from the checks above, and a couple of minor comments? It would be great to have that change in Donkey Car. |
'donkeycar_version': str(__version__) | ||
} | ||
) | ||
print(f"URL submitted to: {submit_job_url}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please switch to logging statements instead of print, that's what we are using now.
# The MultipartEncoder provides the content-type header with the boundary: | ||
headers={'Content-Type': mp_encoder.content_type} | ||
) | ||
print(f"Submission response: HTTP {r.status_code}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
if ("job_uuid" in r.json()): | ||
try: | ||
uuid = r.json()['job_uuid'] | ||
print(f"Submitted a Training Job to the remote server {submit_job_url}\n uuid: {uuid}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
print(f"Submitted a Training Job to the remote server {submit_job_url}\n uuid: {uuid}") | ||
return uuid | ||
except Exception as e: | ||
print(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
def get_latest_job_status_from_hq(self, refresh_job_statuses, job_uuids): | ||
import requests | ||
|
||
print(f"Getting lastest job status for uuid {job_uuids}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
while checkagain and (run_time < self.TIMEOUT): | ||
result = self.get_latest_job_status_from_hq(url_status, uuid)[0] | ||
checkagain = result['status'] == "SCHEDULED" or result['status'] == "TRAINING" | ||
print(f"Training Status:{result['status']} at {datetime.now().strftime('%Y-%m-%d, %H:%M:%S')}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
raise Exception(f"Failed to train the submitted job\nError : {result['status']}") | ||
|
||
run_time = time.time()-start_time | ||
print(f"Time spent: {run_time} s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
@@ -33,6 +33,9 @@ dependencies: | |||
- psutil | |||
- kivy=2.0.0 | |||
- plotly | |||
- requests | |||
- requests_toolbelt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we can't find this package in conda main and conda forge channels, maybe it's only available in pip?
@@ -35,6 +35,9 @@ dependencies: | |||
- kivy=2.0.0 | |||
- plotly | |||
- tensorflow==2.2.0 | |||
- requests | |||
- requests_toolbelt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to produce the same problem as in the Mac conda yaml file.
@@ -34,6 +34,9 @@ dependencies: | |||
- kivy=2.0.0 | |||
- plotly | |||
- psutil | |||
- requests | |||
- requests_toolbelt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea if this works in windows... do you have any means to check that?
This commit adds a new command "trainremote" that allows users to train their ai model on our AWS server. Currently, the training is hardcoded to train a KerasLinear model with max epoch of 30. You can use this command as the following example shown,
with mandatory arguments
--tub
,--path
and--model
,, and optional arguments--url
and--get
--tub: specify the path(s) of your data tub(s)
--path: specify the path of your mycar
--model: specify the file name (without extension) of the resulting model, loss function plot and movie
--url: the url to post a training request
--get: the url to get the status of the submitted job
After submitting a job via the "trainremote", the training takes place on our AWS server with a max. duration of 15 min. You can view your submitted job (and others' jobs) on https://hq.robocarstore.com/train/. Usually, the training takes around 10 min to complete, a model file, a plot and a movie are then downloaded from our server, and saved to the folder 'models' and 'movies' respectively.