-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Azure default image #2468
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @cblmemo! as this PR upgrades CUDA version to 12. we want to make sure it works for common GPUs. Can we test T4?
sky/clouds/azure.py
Outdated
# ubuntu-2004 v21.11.04, one image we used to use for V1 HyperV | ||
# before we change default image to ubuntu-hpc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gen_version
is reference to HyperV generation IIUC?
elif item['name'] == 'HyperVGenerations': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment does not read. Could you rephrase the comment? Aslo, does the new ubuntu-hpc
work for gen V1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it only supports V2. To double-check, you can use:
import json
import subprocess
from concurrent.futures import ThreadPoolExecutor
images_output = subprocess.check_output(
"az vm image list --publisher microsoft-dsvm --offer ubuntu-hpc --all", shell=True
).decode("utf-8")
images = json.loads(images_output)
print(f'Found {len(images)} images')
def check_image(image):
output = (
subprocess.check_output(
(
"az vm image show "
f'--publisher {image["publisher"]} '
f'--offer {image["offer"]} '
f'--sku {image["sku"]} '
f'--version {image["version"]} '
'--query "hyperVGeneration" -o tsv'
),
shell=True,
)
.decode("utf-8")
.strip()
)
print(output)
with ThreadPoolExecutor(max_workers=10) as executor:
list(executor.map(check_image, images))
Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com>
Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com>
I just realized there are not a lot of tests for Azure in our default smoke test... Will run with |
We could just test it using the TGI example. with T4 + 3b model? if this works we can merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this @cblmemo! Left some comments about the cloud init.
sky/authentication.py
Outdated
# Cloud init script is encoded in base64 in azure, so we need to decode it | ||
# and replace the ssh user and public key content, then encode it back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to encode the script with base64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... Azure only accepts base64 encoded in the customData
field. See #1910 (comment) for more information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add a comment for this
@@ -249,13 +253,15 @@ def make_deploy_resources_variables( | |||
# This script will modify /etc/ssh/sshd_config and add a bash script | |||
# into .bashrc. The bash script will restart sshd if it has not been | |||
# restarted, identified by a file /tmp/__restarted is existing. | |||
# Also, add default user to docker group. | |||
# pylint: disable=line-too-long | |||
cloud_init_setup_commands = base64.b64encode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, do we really need to encode the cloud init to base64? Is it possible to use a plain text with some careful yaml syntax, e.g. using : |
or : |-
to avoid unexpected new line and indents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @cblmemo! Left several comments.
sky/clouds/azure.py
Outdated
# ubuntu-2004 v21.11.04, one image we used to use for V1 HyperV | ||
# before we change default image to ubuntu-hpc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment does not read. Could you rephrase the comment? Aslo, does the new ubuntu-hpc
work for gen V1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the image for Azure instances @cblmemo! cc'ing @romilbhardwaj to have a test on this PR for #2751.
Fixed - Please let me know when you finish testing @romilbhardwaj and I'll merge this 🫡 |
The latest PR seems working with the following
Please feel free to merge it @cblmemo. : ) |
@cblmemo Is this only available on nightly? Because I keep getting error than the cuda driver is less than 11.8 when running Azure VM with A100 GPU |
@datainvestor Yes, this should be available in nightly: |
This PR updates the default Azure Image to
ubuntu-hpc
20230803, an image with the latest NVIDIA driver(535.54.03) and CUDA version (12.2).Reference: https://github.com/Azure/azhpc-images/releases/tag/ubuntu-hpc-20230803
For context, seems like all
ubuntu-hpc
image is V2 so I still use the original image for V1 instances.Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh