Skip to content

CI timeout (test-llava-runner-linux) since #7922 #8180

@swolchok

Description

@swolchok

🐛 Describe the bug

Changing from nightly wheel to --use-pt-pinned-commit (from-source build of PyTorch pinned commit, which matches nightly) caused CI timeouts for long jobs, apparently with large timestamp "gaps" in logs

In the raw logs for the first test-llava-runner-linux timeout on main, there are almost 40 minutes (EDIT: actually 83 minutes, see 43-minute gap below) of "gaps" in the logs with no timestamps. Specifically:

  • 14 minute "gap" in logs, jump from 2025-01-31T23:11:50.6702881Z to 2025-01-31T23:25:21.0383351Z, during export.
  • 25 minute "gap" in logs from 2025-01-31T23:25:21.2243143Z to 2025-01-31T23:42:40.6914293Z , and the second message is just a job timeout. Seems to also be during export; not sure why we are exporting multiple times offhand, but that's a separate problem regardless.

@metascroy found that increasing the timeout to 180 minutes causes the job in question to succeed after 150 minutes.

I've ruled out safetensors download being the cause; it took about 6.5 minutes in the last good run and about 6 minutes in the first bad run.

Versions

N/A

Metadata

Metadata

Assignees

Labels

module: ciIssues related to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions