-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make pending timeout customizable #1268
Conversation
@@ -29,7 +29,7 @@ def estimator_op(name, image, command, | |||
evaluator=False, evaluator_cpu_limit=0, evaluator_memory_limit=0, | |||
env=[], data=[], sync_source=None, | |||
metrics=['Train-accuracy:PERCENTAGE'], | |||
arena_image='cheyang/arena_launcher:v0.5', | |||
arena_image='cheyang/arena_launcher:v0.6', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these images be moved to a more neutral docker registry name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your suggestions. I'd like to move it to a neutral docker registry name. I will keep it in mind and try to find. Thanks.
@@ -62,7 +62,7 @@ def parameter_servers_op(name, image, command, env, data, sync_source, annotatio | |||
tensorboard, | |||
worker_port, ps_port, | |||
metrics=['Train-accuracy:PERCENTAGE'], | |||
arena_image='cheyang/arena_launcher:v0.5', | |||
arena_image='cheyang/arena_launcher:v0.6', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
@@ -250,6 +250,9 @@ def main(argv=None): | |||
parser.add_argument('--timeout-hours', type=int, | |||
default=200, | |||
help='Time in hours to wait for the Job submitted by arena to complete') | |||
parser.add_argument('--pending-timeout-minutes', type=int, | |||
default=360, | |||
help='Time in hours to wait for the Job submitted by arena from pending to running') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The var says --pending-timeout-minutes, whereas the descriptions says hours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for spotting it! Fix it.
Thanks for the changes. |
/assign @hongye-sun |
@@ -249,7 +249,10 @@ def main(argv=None): | |||
parser.add_argument('--tensorboard-image', type=str, default='tensorflow/tensorflow:1.12.0') | |||
parser.add_argument('--timeout-hours', type=int, | |||
default=200, | |||
help='Time in hours to wait for the Job submitted by arena to complete') | |||
help='Time in minutes to wait for the Job submitted by arena to complete') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be hours, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because training always takes more than 1 hour.
/approve Please unhold the PR when you feel it's ready to merge. Thanks. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: animeshsingh, hongye-sun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: animeshsingh, hongye-sun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
/hold cancel |
* make pending timeout customizable * fix the description of arg
This change is