[Lambda] Add H100 GPU support #1948

Michaelvll · 2023-05-10T18:45:58Z

Lambda Labs recently announced the support for H100 GPU, which is a more powerful version for ML training. We should include it in our Lambda Labs' catalog and test if it is supported.

zetavg · 2023-05-18T12:36:28Z

I’ve managed to get the new H100 GPU working with SkyPilot by using a dirty patch.
Just a note for those who want to use H100 before the official support:

1. First, edit `~/.sky/catalogs/v5/lambda/vms.csv` and add this line at the end:

gpu_1x_h100_pcie,H100,1.0,26.0,200.0,2.40,us-west-3,"{'Gpus': [{'Name': 'H100', 'Manufacturer': 'NVIDIA', 'Count': 1.0, 'MemoryInfo': {'SizeInMiB': 81920}}], 'TotalGpuMemoryInMiB': 81920}",

Note: I’m not sure what regions are available, here I only add us-west-3. Might need to add multiple lines for every available region.

2. Next, patch `sky/skylet/providers/lambda_cloud/node_provider.py`:

Find the file in your system (for example, ~/miniconda3/envs/sky/lib/python3.8/site-packages/sky/skylet/providers/lambda_cloud/node_provider.py), and replace the following line:

_GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep -Eo "10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

With this:

_GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep -Eo "(10|172)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

This is necessary because it appears that H100 instances use different internal IPs, which start with 172 instead of 10 in my case. If you encounter a Failed to obtain private IP from node error during instance launch, try SSH into the instance and run ip -4 -br addr show to see the IP and adjust the command to match yours.

4. In the task yaml, use the following under `resources`:

resources:
  instance_type: gpu_1x_h100_pcie
  cloud: lambda

In theory, using accelerators: H100:1 should also work, but I haven't put it to the test.

Additional Notes

You may come across errors like:

RuntimeError: CUDA error: no kernel image is available for execution on the device

Or:

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

If that happens, a newer version of PyTorch is probably needed:

pip uninstall torch
pip cache purge
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

To verify, run this in Python:

import torch
torch.tensor([1.0, 2.0]).cuda()

References:

concretevitamin · 2023-05-19T16:06:23Z

@zetavg Thanks a bunch, this is awesome! We're very happy to accept a PR on adding H100 if you'd like. Let us know :)

zetavg · 2023-05-19T18:21:16Z

@concretevitamin I'll be happy to do so! Just opened the first one #1969 to resolve Failed to obtain private IP from node problem.

For the catalogs, I'm still finding the full list of available regions for the H100 instance.

concretevitamin · 2023-05-23T16:20:53Z

From the launch console, looks like we can include us-west-3 for now.

ewzeng · 2023-06-02T18:21:06Z

H100 instances have been added to the Lambda catalog (link)

To update your local catalog, simply delete it:

rm ~/.sky/catalogs/v5/lambda/vms.csv

Michaelvll added the feature-request label May 10, 2023

zetavg mentioned this issue May 19, 2023

[Lambda Cloud] Update Regex of Internal IP for H100 Support #1969

Merged

4 tasks

ewzeng closed this as completed Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lambda] Add H100 GPU support #1948

[Lambda] Add H100 GPU support #1948

Michaelvll commented May 10, 2023

zetavg commented May 18, 2023 •

edited

Loading

concretevitamin commented May 19, 2023

zetavg commented May 19, 2023

concretevitamin commented May 23, 2023

ewzeng commented Jun 2, 2023

[Lambda] Add H100 GPU support #1948

[Lambda] Add H100 GPU support #1948

Comments

Michaelvll commented May 10, 2023

zetavg commented May 18, 2023 • edited Loading

1. First, edit ~/.sky/catalogs/v5/lambda/vms.csv and add this line at the end:

2. Next, patch sky/skylet/providers/lambda_cloud/node_provider.py:

4. In the task yaml, use the following under resources:

Additional Notes

concretevitamin commented May 19, 2023

zetavg commented May 19, 2023

concretevitamin commented May 23, 2023

ewzeng commented Jun 2, 2023

zetavg commented May 18, 2023 •

edited

Loading

1. First, edit `~/.sky/catalogs/v5/lambda/vms.csv` and add this line at the end:

2. Next, patch `sky/skylet/providers/lambda_cloud/node_provider.py`:

4. In the task yaml, use the following under `resources`: