Skip to content

Conversation

atalman
Copy link
Contributor

@atalman atalman commented Jul 26, 2023

Preinstall Nvidia driver during provisioning of GPU machines.
Fixes: #4385

@vercel
Copy link

vercel bot commented Jul 26, 2023

@atalman is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 26, 2023
@atalman atalman changed the title Test build wheels Preinstall Nvidia driver during provisioning Jul 26, 2023
@@ -21,6 +21,12 @@ yum install -y curl jq git
USER_NAME=ec2-user
${install_config_runner}

%{ if nvidia_driver_install ~}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, I have seen the installation fails because of underlying hardware issue right at the beginning, so having try/catch and retry here would be nice. Then, if retrying still fails, the instance would not be added to the runner pool (I assume that it's the behavior when user-data.sh script fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done added: set +e .. set -e around this. So that if it fails - we continue executing. The pytorch core will install nvidia-smi during CI in this case.

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better approach IMO would be to have this as part of the AMI, but as we don't have that atm. This is the next best thing.

@vercel
Copy link

vercel bot commented Jul 26, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
torchci ⬜️ Ignored (Inspect) Jul 26, 2023 6:35pm

Copy link
Member

@osalpekar osalpekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn's comment makes sense, looks like this is the best solution we have atm?

@atalman atalman merged commit 894c16e into pytorch:main Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. with-ssh
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky manywheel builds : Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
4 participants