Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Azure default image #2468

Merged
merged 16 commits into from
Nov 16, 2023
Merged

Update Azure default image #2468

merged 16 commits into from
Nov 16, 2023

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Aug 27, 2023

This PR updates the default Azure Image to ubuntu-hpc 20230803, an image with the latest NVIDIA driver(535.54.03) and CUDA version (12.2).

Reference: https://github.com/Azure/azhpc-images/releases/tag/ubuntu-hpc-20230803

For context, seems like all ubuntu-hpc image is V2 so I still use the original image for V1 instances.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @cblmemo! as this PR upgrades CUDA version to 12. we want to make sure it works for common GPUs. Can we test T4?

sky/clouds/azure.py Outdated Show resolved Hide resolved
sky/clouds/azure.py Outdated Show resolved Hide resolved
Comment on lines 163 to 164
# ubuntu-2004 v21.11.04, one image we used to use for V1 HyperV
# before we change default image to ubuntu-hpc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gen_version is reference to HyperV generation IIUC?

elif item['name'] == 'HyperVGenerations':

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not read. Could you rephrase the comment? Aslo, does the new ubuntu-hpc work for gen V1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it only supports V2. To double-check, you can use:

import json
import subprocess
from concurrent.futures import ThreadPoolExecutor

images_output = subprocess.check_output(
    "az vm image list --publisher microsoft-dsvm --offer ubuntu-hpc --all", shell=True
).decode("utf-8")

images = json.loads(images_output)
print(f'Found {len(images)} images')


def check_image(image):
    output = (
        subprocess.check_output(
            (
                "az vm image show "
                f'--publisher {image["publisher"]} '
                f'--offer {image["offer"]} '
                f'--sku {image["sku"]} '
                f'--version {image["version"]} '
                '--query "hyperVGeneration" -o tsv'
            ),
            shell=True,
        )
        .decode("utf-8")
        .strip()
    )
    print(output)


with ThreadPoolExecutor(max_workers=10) as executor:
    list(executor.map(check_image, images))

cblmemo and others added 2 commits August 26, 2023 23:31
Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com>
Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com>
@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 27, 2023

thanks @cblmemo! as this PR upgrades CUDA version to 12. we want to make sure it works for common GPUs. Can we test T4?

I just realized there are not a lot of tests for Azure in our default smoke test... Will run with --azure flag again

@infwinston
Copy link
Member

We could just test it using the TGI example. with T4 + 3b model? if this works we can merge it.

@romilbhardwaj romilbhardwaj modified the milestone: 0.4 Sep 11, 2023
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this @cblmemo! Left some comments about the cloud init.

Comment on lines 271 to 272
# Cloud init script is encoded in base64 in azure, so we need to decode it
# and replace the ssh user and public key content, then encode it back.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to encode the script with base64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... Azure only accepts base64 encoded in the customData field. See #1910 (comment) for more information

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a comment for this

@@ -249,13 +253,15 @@ def make_deploy_resources_variables(
# This script will modify /etc/ssh/sshd_config and add a bash script
# into .bashrc. The bash script will restart sshd if it has not been
# restarted, identified by a file /tmp/__restarted is existing.
# Also, add default user to docker group.
# pylint: disable=line-too-long
cloud_init_setup_commands = base64.b64encode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, do we really need to encode the cloud init to base64? Is it possible to use a plain text with some careful yaml syntax, e.g. using : | or : |- to avoid unexpected new line and indents?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @cblmemo! Left several comments.

sky/authentication.py Outdated Show resolved Hide resolved
Comment on lines 163 to 164
# ubuntu-2004 v21.11.04, one image we used to use for V1 HyperV
# before we change default image to ubuntu-hpc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not read. Could you rephrase the comment? Aslo, does the new ubuntu-hpc work for gen V1?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the image for Azure instances @cblmemo! cc'ing @romilbhardwaj to have a test on this PR for #2751.

sky/clouds/azure.py Show resolved Hide resolved
sky/clouds/azure.py Show resolved Hide resolved
@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 3, 2023

Fixed - Please let me know when you finish testing @romilbhardwaj and I'll merge this 🫡

@Michaelvll
Copy link
Collaborator

The latest PR seems working with the following nvidia-smi output:

Thu Nov 16 23:26:09 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000001:00:00.0 Off |                    0 |
| N/A   28C    P0              41W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |

Please feel free to merge it @cblmemo. : )

@cblmemo cblmemo merged commit 4464aee into master Nov 16, 2023
18 checks passed
@cblmemo cblmemo deleted the azure-new-image branch November 16, 2023 23:59
@datainvestor
Copy link

datainvestor commented Nov 30, 2023

@cblmemo Is this only available on nightly? Because I keep getting error than the cuda driver is less than 11.8 when running Azure VM with A100 GPU

@concretevitamin
Copy link
Member

@datainvestor Yes, this should be available in nightly: pip install -U skypilot-nightly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants