Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lambda] Add H100 GPU support #1948

Closed
Michaelvll opened this issue May 10, 2023 · 5 comments
Closed

[Lambda] Add H100 GPU support #1948

Michaelvll opened this issue May 10, 2023 · 5 comments

Comments

@Michaelvll
Copy link
Collaborator

Lambda Labs recently announced the support for H100 GPU, which is a more powerful version for ML training. We should include it in our Lambda Labs' catalog and test if it is supported.

@zetavg
Copy link
Contributor

zetavg commented May 18, 2023

I’ve managed to get the new H100 GPU working with SkyPilot by using a dirty patch.
Just a note for those who want to use H100 before the official support:

1. First, edit ~/.sky/catalogs/v5/lambda/vms.csv and add this line at the end:

gpu_1x_h100_pcie,H100,1.0,26.0,200.0,2.40,us-west-3,"{'Gpus': [{'Name': 'H100', 'Manufacturer': 'NVIDIA', 'Count': 1.0, 'MemoryInfo': {'SizeInMiB': 81920}}], 'TotalGpuMemoryInMiB': 81920}",

Note: I’m not sure what regions are available, here I only add us-west-3. Might need to add multiple lines for every available region.

2. Next, patch sky/skylet/providers/lambda_cloud/node_provider.py:

Find the file in your system (for example, ~/miniconda3/envs/sky/lib/python3.8/site-packages/sky/skylet/providers/lambda_cloud/node_provider.py), and replace the following line:

_GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep -Eo "10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

With this:

_GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep -Eo "(10|172)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

This is necessary because it appears that H100 instances use different internal IPs, which start with 172 instead of 10 in my case. If you encounter a Failed to obtain private IP from node error during instance launch, try SSH into the instance and run ip -4 -br addr show to see the IP and adjust the command to match yours.

4. In the task yaml, use the following under resources:

resources:
  instance_type: gpu_1x_h100_pcie
  cloud: lambda

In theory, using accelerators: H100:1 should also work, but I haven't put it to the test.

Additional Notes

You may come across errors like:

RuntimeError: CUDA error: no kernel image is available for execution on the device

Or:

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

If that happens, a newer version of PyTorch is probably needed:

pip uninstall torch
pip cache purge
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

To verify, run this in Python:

import torch
torch.tensor([1.0, 2.0]).cuda()

References:

@concretevitamin
Copy link
Member

@zetavg Thanks a bunch, this is awesome! We're very happy to accept a PR on adding H100 if you'd like. Let us know :)

@zetavg
Copy link
Contributor

zetavg commented May 19, 2023

@concretevitamin I'll be happy to do so! Just opened the first one #1969 to resolve Failed to obtain private IP from node problem.

For the catalogs, I'm still finding the full list of available regions for the H100 instance.

@concretevitamin
Copy link
Member

From the launch console, looks like we can include us-west-3 for now.
Screen Shot 2023-05-23 at 09 20 20

@ewzeng
Copy link
Collaborator

ewzeng commented Jun 2, 2023

H100 instances have been added to the Lambda catalog (link)

To update your local catalog, simply delete it:

rm ~/.sky/catalogs/v5/lambda/vms.csv

@ewzeng ewzeng closed this as completed Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants