Skip to content

Conversation

@zuston
Copy link
Member

@zuston zuston commented Oct 13, 2021

No description provided.

@zuston zuston marked this pull request as draft October 13, 2021 08:03
@oliverhu
Copy link
Member

just tony.x.timeout? tracked tasks also need a timeout?

@zuston
Copy link
Member Author

zuston commented Oct 14, 2021

Refer #603

Why
On TFRuntime, I found two problems on tensorflow training on our production env.

Sometimes, when using tf estimator api, evaluator maybe wait the newest global step but chief has finished. So evaluator will hang. TF doc
Sometimes, worker will hang due to tf bugs, but chief has finished its task. So we need to kill workers which not finished.
Besides do we need to make job failed when some worker or evaluator not finished?

I think it can depends on user with config.

I think it's enough to solve the above problems by introducing untracked task timeout.

Do you have other ideas about possible problem?

@zuston
Copy link
Member Author

zuston commented Oct 15, 2021

Gentle ping @oliverhu

@oliverhu
Copy link
Member

I think the tony.x.timeout notion is more generic and easier to comprehend. People can specify tony.evaluator.timeout=100s etc.

@oliverhu
Copy link
Member

We discussed offline to have some generic task grouping and specify dependencies..
tony.groupA.timeout.aftergroupB = 10h etc.

@zuston
Copy link
Member Author

zuston commented Nov 25, 2021

Close it. Solved by #621

@zuston zuston closed this Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants