-
Notifications
You must be signed in to change notification settings - Fork 102
Labels
gha infraRelated to our self hosted Github Actions infrastructureRelated to our self hosted Github Actions infrastructure
Description
See following errors in audio and vision :
https://github.com/pytorch/vision/actions/runs/5536861083/jobs/10105070853
https://github.com/pytorch/audio/actions/runs/5536859853/jobs/10105065270
Happens during initialize containers:
Digest: sha256:98496e83272013c2c5a0d28a2759ad952372210559879f436bf57b529bd95b0f
Status: Downloaded newer image for pytorch/manylinux-builder:cuda11.8
docker.io/pytorch/manylinux-builder:cuda11.8
/usr/bin/docker create --name 71f4513435284b16a1173e5b4f1e4775_pytorchmanylinuxbuildercuda118_9719a7 --label dc4f4d --workdir /__w/vision/vision --network github_network_bd81af612f9f40328280bfd2cbb98a6a --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/ec2-user/actions-runner/_work":"/__w" -v "/home/ec2-user/actions-runner/externals":"/__e":ro -v "/home/ec2-user/actions-runner/_work/_temp":"/__w/_temp" -v "/home/ec2-user/actions-runner/_work/_actions":"/__w/_actions" -v "/home/ec2-user/actions-runner/_work/_tool":"/__w/_tool" -v "/home/ec2-user/actions-runner/_work/_temp/_github_home":"/github/home" -v "/home/ec2-user/actions-runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" pytorch/manylinux-builder:cuda11.8 "-f" "/dev/null"
6887acb3f5002bf674b558cae0b7e3dc00e57d85aac830fd4df03aee89e54adc
/usr/bin/docker start 6887acb3f5002bf674b558cae0b7e3dc00e57d85aac830fd4df03aee89e54adc
Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
Error: failed to start containers: 6887acb3f5002bf674b558cae0b7e3dc00e57d85aac830fd4df03aee89e54adc
Error: Docker start fail with exit code 1
This error happens only with cuda builds.
Started happening in audio since Fri Jul 7, 12:16 pm:
https://github.com/pytorch/audio/actions/runs/5488475799/jobs/10001389130
This worklfows shows that, job may not always fail on rerun: https://github.com/pytorch/vision/actions/runs/5519379435/jobs/10065684909
Metadata
Metadata
Assignees
Labels
gha infraRelated to our self hosted Github Actions infrastructureRelated to our self hosted Github Actions infrastructure