Skip to content

v20250327-191353

Compare
Choose a tag to compare
@github-actions github-actions released this 27 Mar 19:16
· 41 commits to main since this release
38d53d7
[ALI] Fix concurrency issues for tryReuse on scaleUp (#6477)

As noted in [this](https://github.com/pytorch/test-infra/issues/6473)
issue, there is a concurrency problem between tryReuse on scaleUp and
scaleDown.

This PR addresses this by making sure `tryReuse` will not use 'stale'
runners (older than a certain amount), and scaleDown will only remove
the ones older than a certain time (logic was already implemented).

Note that for this PR to properly work, it is expected that the TF
variable `minimum_running_time_in_minutes` to be increased. I believe
ideally 45 minutes or more.