Introduce tony.application.x.untracked.timeout to solve partial jobs hang #610

zuston · 2021-10-13T08:03:53Z

No description provided.

…unfinished

oliverhu · 2021-10-13T20:45:24Z

just tony.x.timeout? tracked tasks also need a timeout?

zuston · 2021-10-14T02:16:20Z

Refer #603

Why
On TFRuntime, I found two problems on tensorflow training on our production env.

Sometimes, when using tf estimator api, evaluator maybe wait the newest global step but chief has finished. So evaluator will hang. TF doc
Sometimes, worker will hang due to tf bugs, but chief has finished its task. So we need to kill workers which not finished.
Besides do we need to make job failed when some worker or evaluator not finished?

I think it can depends on user with config.

I think it's enough to solve the above problems by introducing untracked task timeout.

Do you have other ideas about possible problem?

zuston · 2021-10-15T02:09:33Z

Gentle ping @oliverhu

oliverhu · 2021-10-15T07:27:50Z

I think the tony.x.timeout notion is more generic and easier to comprehend. People can specify tony.evaluator.timeout=100s etc.

oliverhu · 2021-10-25T20:48:19Z

We discussed offline to have some generic task grouping and specify dependencies..
tony.groupA.timeout.aftergroupB = 10h etc.

zuston · 2021-11-25T09:13:57Z

Close it. Solved by #621

zuston marked this pull request as draft October 13, 2021 08:03

zuston force-pushed the hb branch from 4b8c2bb to 1041dbf Compare October 13, 2021 08:04

Introduce tony.application.x.untracked.timeout to solve partial jobs …

a8550c3

…unfinished

zuston force-pushed the hb branch from 1041dbf to a8550c3 Compare October 13, 2021 09:12

zuston requested a review from oliverhu October 13, 2021 10:09

Merge branch 'master' into hb

b8449e9

zuston mentioned this pull request Nov 25, 2021

Make job fail when partial tasks' pre-dependent tasks finished and exceeds the waiting timeout #621

Merged

zuston closed this Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce tony.application.x.untracked.timeout to solve partial jobs hang #610

Introduce tony.application.x.untracked.timeout to solve partial jobs hang #610

Uh oh!

zuston commented Oct 13, 2021

Uh oh!

oliverhu commented Oct 13, 2021

Uh oh!

zuston commented Oct 14, 2021

Uh oh!

zuston commented Oct 15, 2021

Uh oh!

oliverhu commented Oct 15, 2021

Uh oh!

oliverhu commented Oct 25, 2021

Uh oh!

zuston commented Nov 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Introduce tony.application.x.untracked.timeout to solve partial jobs hang #610

Introduce tony.application.x.untracked.timeout to solve partial jobs hang #610

Uh oh!

Conversation

zuston commented Oct 13, 2021

Uh oh!

oliverhu commented Oct 13, 2021

Uh oh!

zuston commented Oct 14, 2021

Uh oh!

zuston commented Oct 15, 2021

Uh oh!

oliverhu commented Oct 15, 2021

Uh oh!

oliverhu commented Oct 25, 2021

Uh oh!

zuston commented Nov 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants