Description
🐛 Describe the bug
Changing from nightly wheel to --use-pt-pinned-commit (from-source build of PyTorch pinned commit, which matches nightly) caused CI timeouts for long jobs, apparently with large timestamp "gaps" in logs
In the raw logs for the first test-llava-runner-linux timeout on main, there are almost 40 minutes (EDIT: actually 83 minutes, see 43-minute gap below) of "gaps" in the logs with no timestamps. Specifically:
- 14 minute "gap" in logs, jump from 2025-01-31T23:11:50.6702881Z to 2025-01-31T23:25:21.0383351Z, during export.
- 25 minute "gap" in logs from 2025-01-31T23:25:21.2243143Z to 2025-01-31T23:42:40.6914293Z , and the second message is just a job timeout. Seems to also be during export; not sure why we are exporting multiple times offhand, but that's a separate problem regardless.
@metascroy found that increasing the timeout to 180 minutes causes the job in question to succeed after 150 minutes.
I've ruled out safetensors download being the cause; it took about 6.5 minutes in the last good run and about 6 minutes in the first bad run.
Versions
N/A