Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Trainium blueprint Part1 #251

Merged
merged 10 commits into from
Jul 17, 2023
Merged

feat: Trainium blueprint Part1 #251

merged 10 commits into from
Jul 17, 2023

Conversation

vara-bonthu
Copy link
Collaborator

@vara-bonthu vara-bonthu commented Jul 15, 2023

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Part 1

  • Introduces a Trainium blueprint with trn1.32xlarge node group that includes 8 network interfaces.
  • Adds Karpenter configuration for Trainium.
  • Includes two new local Helm charts for the Neuron plugin and EFA Plugin.
  • Updated Airflow blueprint and fixed few issues

Part 2

  • Deploys Volcano with TorchX, along with relevant examples.
  • Adds FSx for Lustre for dataset sharing purposes.
  • Updates the website documentation to include deployment and model training details.

Motivation

In Part 1, the introduction of the Trainium blueprint with a Trn1.32xlarge node group and 8 network interfaces lays the foundation for handling larger workloads efficiently. Additionally, the inclusion of Karpenter configuration optimizes resource management, resulting in improved cluster efficiency and cost optimization. Furthermore, the incorporation of two new local Helm charts for the Neuron plugin and EFA Plugin expands the project's capabilities, allowing users to leverage specialized features provided by these plugins.

In Part 2, the deployment of Volcano with TorchX, accompanied by relevant examples, streamlines workload management and scheduling, providing a more efficient and reliable solution for running complex computing tasks. The addition of FSx for Lustre enhances data sharing capabilities, enabling seamless collaboration and simplified access to shared datasets. Lastly, the updated website documentation provides comprehensive details on deployment and model training, empowering users to easily understand and utilize the project, thus facilitating a smoother onboarding experience and reducing the learning curve. Collectively, these changes aim to elevate the project's overall functionality, performance, and usability.

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

@vara-bonthu vara-bonthu temporarily deployed to DoEKS Test July 15, 2023 17:30 — with GitHub Actions Inactive
@vara-bonthu vara-bonthu temporarily deployed to DoEKS Test July 15, 2023 18:25 — with GitHub Actions Inactive
@vara-bonthu vara-bonthu temporarily deployed to DoEKS Test July 15, 2023 18:37 — with GitHub Actions Inactive
@vara-bonthu vara-bonthu temporarily deployed to DoEKS Test July 15, 2023 18:38 — with GitHub Actions Inactive
@vara-bonthu vara-bonthu temporarily deployed to DoEKS Test July 15, 2023 19:55 — with GitHub Actions Inactive
@vara-bonthu vara-bonthu temporarily deployed to DoEKS Test July 15, 2023 20:31 — with GitHub Actions Inactive
@vara-bonthu vara-bonthu temporarily deployed to DoEKS Test July 15, 2023 20:32 — with GitHub Actions Inactive
Copy link
Contributor

@ovaleanu ovaleanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Brilliant!

@ovaleanu ovaleanu merged commit c54b857 into main Jul 17, 2023
@vara-bonthu vara-bonthu deleted the trainium-blueprint-part1 branch July 18, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants