v20250327-191353
·
41 commits
to main
since this release
[ALI] Fix concurrency issues for tryReuse on scaleUp (#6477) As noted in [this](https://github.com/pytorch/test-infra/issues/6473) issue, there is a concurrency problem between tryReuse on scaleUp and scaleDown. This PR addresses this by making sure `tryReuse` will not use 'stale' runners (older than a certain amount), and scaleDown will only remove the ones older than a certain time (logic was already implemented). Note that for this PR to properly work, it is expected that the TF variable `minimum_running_time_in_minutes` to be increased. I believe ideally 45 minutes or more.