Skip to content

Flaky manywheel builds : Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #4385

@atalman

Description

@atalman

See following errors in audio and vision :
https://github.com/pytorch/vision/actions/runs/5536861083/jobs/10105070853
https://github.com/pytorch/audio/actions/runs/5536859853/jobs/10105065270

Happens during initialize containers:

  Digest: sha256:98496e83272013c2c5a0d28a2759ad952372210559879f436bf57b529bd95b0f
  Status: Downloaded newer image for pytorch/manylinux-builder:cuda11.8
  docker.io/pytorch/manylinux-builder:cuda11.8
  /usr/bin/docker create --name 71f4513435284b16a1173e5b4f1e4775_pytorchmanylinuxbuildercuda118_9719a7 --label dc4f4d --workdir /__w/vision/vision --network github_network_bd81af612f9f40328280bfd2cbb98a6a --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/ec2-user/actions-runner/_work":"/__w" -v "/home/ec2-user/actions-runner/externals":"/__e":ro -v "/home/ec2-user/actions-runner/_work/_temp":"/__w/_temp" -v "/home/ec2-user/actions-runner/_work/_actions":"/__w/_actions" -v "/home/ec2-user/actions-runner/_work/_tool":"/__w/_tool" -v "/home/ec2-user/actions-runner/_work/_temp/_github_home":"/github/home" -v "/home/ec2-user/actions-runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" pytorch/manylinux-builder:cuda11.8 "-f" "/dev/null"
  6887acb3f5002bf674b558cae0b7e3dc00e57d85aac830fd4df03aee89e54adc
  /usr/bin/docker start 6887acb3f5002bf674b558cae0b7e3dc00e57d85aac830fd4df03aee89e54adc
  Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
  Error: failed to start containers: 6887acb3f5002bf674b558cae0b7e3dc00e57d85aac830fd4df03aee89e54adc
  Error: Docker start fail with exit code 1

This error happens only with cuda builds.

Started happening in audio since Fri Jul 7, 12:16 pm:
https://github.com/pytorch/audio/actions/runs/5488475799/jobs/10001389130

This worklfows shows that, job may not always fail on rerun: https://github.com/pytorch/vision/actions/runs/5519379435/jobs/10065684909

Metadata

Metadata

Assignees

Labels

gha infraRelated to our self hosted Github Actions infrastructure

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions