-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Allow disabling ECC for nvidia-gpu #3676
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll!
sky/backends/backend_utils.py
Outdated
if skypilot_config.get_nested(('nvidia_gpus', 'disable_ecc'), False): | ||
initial_setup_commands.append(constants.DISABLE_GPU_ECC_COMMAND) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we append these commands only if disable_ecc
is set to true AND to_provision
contains GPU requests? Otherwise this will apply to CPU-only clusters too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added an additional condidtion for checking the accelerators, though it should be fine even with that additional check for the CPU instances, as it will just skip the process without nvidia-smi
available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll!
* Disable ECC for nvidia-gpu * Add config.rst * format * address * Note for the reboot overhead * address comments * fix fluidstack * Avoid disable ecc for clouds using ray autoscaler due to the lack of retry after reboot
According to the following link, disabling ECC can improve the GPU performance by 30%, and a user has requested this knob. This is experimental and might be blocked by the task-level config, as the user would like to specify this per-task.
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LKjOCAW
Added config:
We only allow disabling ecc for the clouds with SkyPilot provisioner, as those with ray autoscaler does not handle the retry correctly.
Tested (run the relevant ones):
bash format.sh
sky launch --cloud aws --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2"
with config set to turn off ecc.sky launch --cloud gcp --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2"
with config set to turn off ecc.sky launch --cloud kubernetes --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2"
with config set to turn off ecc. This will still launch the cluster though skipping the ECC disabling.sky launch --cloud runpod --gpus rtxa6000 "nvidia-smi -q | grep 'ECC Mode' -A 2"
RunPod has ecc turned off by default.sky launch -c test-lam --down --cloud lambda --gpus A10 "nvidia-smi -q | grep 'ECC Mode' -A 2"
skip the disabling due to the use of ray autoscalersky launch --cloud gcp --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2"
without config set to turn off ecc.pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh