Add "trainremote" command #906

BillyCheung10botics · 2021-07-21T17:15:26Z

This commit adds a new command "trainremote" that allows users to train their ai model on our AWS server. Currently, the training is hardcoded to train a KerasLinear model with max epoch of 30. You can use this command as the following example shown,

donkey trainremote --tub ~/mycar/data/tub_4_21-06-11/ ~/mycar/data/tub_3_21-06-11/ --path ~/mycar --model modelname 
--url https://hq.robocarstore.com/train/submit_job --get https://hq.robocarstore.com/train/refresh_job_statuses

with mandatory arguments --tub, --path and --model,, and optional arguments --url and --get

--tub: specify the path(s) of your data tub(s)
--path: specify the path of your mycar
--model: specify the file name (without extension) of the resulting model, loss function plot and movie
--url: the url to post a training request
--get: the url to get the status of the submitted job

After submitting a job via the "trainremote", the training takes place on our AWS server with a max. duration of 15 min. You can view your submitted job (and others' jobs) on https://hq.robocarstore.com/train/. Usually, the training takes around 10 min to complete, a model file, a plot and a movie are then downloaded from our server, and saved to the folder 'models' and 'movies' respectively.

DocGarbanzo · 2021-08-02T12:18:24Z

@BillyCheung10botics - this is a great addition. I have three comments:

Can you make the —path optional as we have in the train and other commands?
Can you please convert print statements to logging?
The checks are failing, maybe the package is not available in our conda channels, in that case they should go under pip. Can you check that please?

DocGarbanzo · 2021-09-03T19:45:13Z

@BillyCheung10botics - can you have a look at addressing the installation issues from the checks above, and a couple of minor comments? It would be great to have that change in Donkey Car.

DocGarbanzo · 2021-09-03T19:40:09Z

donkeycar/management/base.py

+                'donkeycar_version': str(__version__)
+            }
+        )
+        print(f"URL submitted to: {submit_job_url}")


Please switch to logging statements instead of print, that's what we are using now.

DocGarbanzo · 2021-09-03T19:40:21Z

donkeycar/management/base.py

+            # The MultipartEncoder provides the content-type header with the boundary:
+            headers={'Content-Type': mp_encoder.content_type}
+        )
+        print(f"Submission response: HTTP {r.status_code}")


DocGarbanzo · 2021-09-03T19:40:40Z

donkeycar/management/base.py

+            if ("job_uuid" in r.json()):
+                try:
+                    uuid = r.json()['job_uuid']
+                    print(f"Submitted a Training Job to the remote server {submit_job_url}\n uuid: {uuid}")


DocGarbanzo · 2021-09-03T19:40:48Z

donkeycar/management/base.py

+                    print(f"Submitted a Training Job to the remote server {submit_job_url}\n uuid: {uuid}")
+                    return uuid
+                except Exception as e:
+                    print(e)


DocGarbanzo · 2021-09-03T19:41:18Z

donkeycar/management/base.py

+    def get_latest_job_status_from_hq(self, refresh_job_statuses, job_uuids):
+        import requests
+
+        print(f"Getting lastest job status for uuid {job_uuids}")


DocGarbanzo · 2021-09-03T19:41:34Z

donkeycar/management/base.py

+            while checkagain and (run_time < self.TIMEOUT):
+                result = self.get_latest_job_status_from_hq(url_status, uuid)[0]
+                checkagain = result['status'] == "SCHEDULED" or result['status'] == "TRAINING"
+                print(f"Training Status:{result['status']} at {datetime.now().strftime('%Y-%m-%d, %H:%M:%S')}")


DocGarbanzo · 2021-09-03T19:41:50Z

donkeycar/management/base.py

+                    raise Exception(f"Failed to train the submitted job\nError : {result['status']}")
+
+                run_time = time.time()-start_time
+                print(f"Time spent: {run_time} s")


DocGarbanzo · 2021-09-03T19:42:29Z

install/envs/mac.yml

@@ -33,6 +33,9 @@ dependencies:
  - psutil
  - kivy=2.0.0
  - plotly
+  - requests
+  - requests_toolbelt


It seems we can't find this package in conda main and conda forge channels, maybe it's only available in pip?

DocGarbanzo · 2021-09-03T19:43:30Z

install/envs/ubuntu.yml

@@ -35,6 +35,9 @@ dependencies:
  - kivy=2.0.0
  - plotly
  - tensorflow==2.2.0
+  - requests
+  - requests_toolbelt


This seems to produce the same problem as in the Mac conda yaml file.

DocGarbanzo · 2021-09-03T19:44:03Z

install/envs/windows.yml

@@ -34,6 +34,9 @@ dependencies:
  - kivy=2.0.0
  - plotly
  - psutil
+  - requests
+  - requests_toolbelt


No idea if this works in windows... do you have any means to check that?

add "trainremote" command

0a69313

DocGarbanzo requested changes Sep 3, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "trainremote" command #906

Add "trainremote" command #906

BillyCheung10botics commented Jul 21, 2021

DocGarbanzo commented Aug 2, 2021

DocGarbanzo commented Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

DocGarbanzo Sep 3, 2021

Add "trainremote" command #906

Are you sure you want to change the base?

Add "trainremote" command #906

Conversation

BillyCheung10botics commented Jul 21, 2021

DocGarbanzo commented Aug 2, 2021

DocGarbanzo commented Sep 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment