Skip to content

[wip] add flyres stack#1104

Open
cosmicBboy wants to merge 2 commits into
mainfrom
nielsb/flyres-stack
Open

[wip] add flyres stack#1104
cosmicBboy wants to merge 2 commits into
mainfrom
nielsb/flyres-stack

Conversation

@cosmicBboy
Copy link
Copy Markdown
Contributor

This pull request adds three comprehensive example scripts to the flyres_stack reference stack, each demonstrating advanced machine learning workflows using Flyte and its plugins. The examples cover efficient image layering for ML environments, distributed training with Ray and Hugging Face model mounts, and PyTorch FSDP-based distributed training as an alternative to Megatron-LM. These scripts serve as practical guides for setting up scalable, modular, and efficient ML pipelines on Flyte.

New Example Scripts:

Efficient Image Build Strategies:

  • Added 01_image_build_strategy.py, demonstrating a two-layer image build approach: a slow-changing base image with PyTorch/CUDA and a faster-changing experimental layer. This minimizes rebuild times and enables rapid experimentation in ML workflows.

Distributed Training with Ray:

  • Added 02_ray_distributed_training.py, showcasing distributed training using the Flyte-Ray plugin. The script integrates Hugging Face model mounts for shared data/model access, sets up a Ray cluster, and demonstrates distributed training, evaluation, and inference serving.

PyTorch FSDP Distributed Training:

  • Added 03_pytorch_fsdp_training.py, providing an example of distributed training using PyTorch’s DDP (and FSDP-style orchestration) via Flyte’s PyTorch plugin. This script serves as a Megatron-LM alternative for large model training, including synthetic dataset preparation, training, evaluation, and checkpointing.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant