Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Allow disabling ECC for nvidia-gpu #3676

Merged
merged 8 commits into from
Jun 21, 2024
Merged

[Core] Allow disabling ECC for nvidia-gpu #3676

merged 8 commits into from
Jun 21, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 19, 2024

According to the following link, disabling ECC can improve the GPU performance by 30%, and a user has requested this knob. This is experimental and might be blocked by the task-level config, as the user would like to specify this per-task.

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LKjOCAW

Added config:

nvidia_gpu:
  disable_ecc: true

We only allow disabling ecc for the clouds with SkyPilot provisioner, as those with ray autoscaler does not handle the retry correctly.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud aws --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2" with config set to turn off ecc.
    • sky launch --cloud gcp --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2" with config set to turn off ecc.
    • sky launch --cloud kubernetes --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2" with config set to turn off ecc. This will still launch the cluster though skipping the ECC disabling.
    • sky launch --cloud runpod --gpus rtxa6000 "nvidia-smi -q | grep 'ECC Mode' -A 2" RunPod has ecc turned off by default.
    • sky launch -c test-lam --down --cloud lambda --gpus A10 "nvidia-smi -q | grep 'ECC Mode' -A 2" skip the disabling due to the use of ray autoscaler
    • sky launch --cloud gcp --gpus t4 "nvidia-smi -q | grep 'ECC Mode' -A 2" without config set to turn off ecc.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

docs/source/reference/config.rst Outdated Show resolved Hide resolved
docs/source/reference/config.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll!

Comment on lines 879 to 880
if skypilot_config.get_nested(('nvidia_gpus', 'disable_ecc'), False):
initial_setup_commands.append(constants.DISABLE_GPU_ECC_COMMAND)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we append these commands only if disable_ecc is set to true AND to_provision contains GPU requests? Otherwise this will apply to CPU-only clusters too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added an additional condidtion for checking the accelerators, though it should be fine even with that additional check for the CPU instances, as it will just skip the process without nvidia-smi available.

sky/templates/runpod-ray.yml.j2 Show resolved Hide resolved
docs/source/reference/config.rst Outdated Show resolved Hide resolved
docs/source/reference/config.rst Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll!

@Michaelvll Michaelvll merged commit 3436b8c into master Jun 21, 2024
20 checks passed
@Michaelvll Michaelvll deleted the disable-ecc branch June 21, 2024 22:27
Michaelvll added a commit that referenced this pull request Aug 23, 2024
* Disable ECC for nvidia-gpu

* Add config.rst

* format

* address

* Note for the reboot overhead

* address comments

* fix fluidstack

* Avoid disable ecc for clouds using ray autoscaler due to the lack of retry after reboot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants