Skip to content

[k8s][examples] rdvz fix for k8s and Llama 4 example #5960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 25, 2025

Conversation

romilbhardwaj
Copy link
Collaborator

Closes #4140 and adds an example for training llama 4 maverick (400B MoE). The example needed a fix for rdvz on k8s, inspired by #3800.

pod_uuid = str(uuid.uuid4())[:6]
pod_name = f'{cluster_name_on_cloud}-{pod_uuid}'
pod_spec_copy['metadata']['name'] = f'{pod_name}-worker'
pod_name = f'{cluster_name_on_cloud}-worker{i}'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may raise some back compat concerns, need to verify

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually tested backcompat with sky exec, sky launch and sky status -r. Since we filter based on labels and not pod names, backward compatibility is unaffected.

@romilbhardwaj
Copy link
Collaborator Author

/smoke-test --kubernetes

@romilbhardwaj
Copy link
Collaborator Author

/quicktest-core --kubernetes

@romilbhardwaj
Copy link
Collaborator Author

/smoke-test --kubernetes

@romilbhardwaj
Copy link
Collaborator Author

Smoke tests failures unrelated to PR:

  • test_skyserve_fast_update
  • test_cancel_launch_and_exec_async

Rest of the tests seem to pass. This should be good to go.

@SeungjinYang
Copy link
Collaborator

/smoke-test --kubernetes -k test_cancel_launch_and_exec_async

@SeungjinYang
Copy link
Collaborator

/smoke-test --kubernetes -k test_skyserve_fast_update

Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@romilbhardwaj romilbhardwaj merged commit 3bf40a2 into master Jun 25, 2025
16 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s-rdvz-fix branch June 25, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Example] Make rdvz work with multi-node SkyPilot clusters
3 participants