Skip to content

Commit 52b5b2d

Browse files
authored
Merge pull request #2021 from FedML-AI/alexleung/dev_branch_online
Alexleung/dev branch online
2 parents 4ea1c7c + 46cbab2 commit 52b5b2d

File tree

128 files changed

+13024
-1317
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

128 files changed

+13024
-1317
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
workspace: "./src"
2+
3+
inference_image_name: "raphaeljin/fedml"
4+
enable_custom_image: true
5+
6+
bootstrap: |
7+
echo "Bootstrap start..."
8+
pwd
9+
ls -l
10+
echo "Check shell script"
11+
cat fedml-deploy-bootstrap-entry-auto-gen.sh
12+
echo "Check main script"
13+
cat serve_main.py
14+
echo "Bootstrap finished"
15+
16+
## Simulate a successful deployment
17+
#job: |
18+
# python3 serve_main.py
19+
20+
# Then during update, simulate a failed deployment
21+
job: |
22+
echo "Simulate a failed deployment"
23+
exit 1
24+
25+
auto_detect_public_ip: true
26+
use_gpu: true
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
from fedml.serving import FedMLPredictor
2+
from fedml.serving import FedMLInferenceRunner
3+
import uuid
4+
import torch
5+
6+
# Calculate the number of elements
7+
num_elements = 1_073_741_824 // 4 # using integer division for whole elements
8+
9+
10+
class DummyPredictor(FedMLPredictor):
11+
def __init__(self):
12+
super().__init__()
13+
# Create a tensor with these many elements
14+
tensor = torch.empty(num_elements, dtype=torch.float32)
15+
16+
# Move the tensor to GPU
17+
tensor_gpu = tensor.cuda()
18+
19+
# for debug
20+
with open("/tmp/dummy_gpu_occupier.txt", "w") as f:
21+
f.write("GPU is occupied")
22+
23+
self.worker_id = uuid.uuid4()
24+
25+
def predict(self, request):
26+
return {f"AlohaV0From{self.worker_id}": request}
27+
28+
29+
if __name__ == "__main__":
30+
predictor = DummyPredictor()
31+
fedml_inference_runner = FedMLInferenceRunner(predictor)
32+
fedml_inference_runner.run()
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
workspace: "./src"
2+
entry_point: "serve_main.py"
3+
bootstrap: |
4+
echo "Bootstrap start..."
5+
sleep 5
6+
echo "Bootstrap finished"
7+
8+
auto_detect_public_ip: true
9+
use_gpu: true
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
from fedml.serving import FedMLPredictor
2+
from fedml.serving import FedMLInferenceRunner
3+
import uuid
4+
import torch
5+
6+
# Calculate the number of elements
7+
num_elements = 1_073_741_824 // 4 # using integer division for whole elements
8+
9+
10+
class DummyPredictor(FedMLPredictor):
11+
def __init__(self):
12+
super().__init__()
13+
# Create a tensor with these many elements
14+
tensor = torch.empty(num_elements, dtype=torch.float32)
15+
16+
# Move the tensor to GPU
17+
tensor_gpu = tensor.cuda()
18+
19+
# for debug
20+
with open("/tmp/dummy_gpu_occupier.txt", "w") as f:
21+
f.write("GPU is occupied")
22+
23+
self.worker_id = uuid.uuid4()
24+
25+
def predict(self, request):
26+
return {f"AlohaV0From{self.worker_id}": request}
27+
28+
29+
if __name__ == "__main__":
30+
predictor = DummyPredictor()
31+
fedml_inference_runner = FedMLInferenceRunner(predictor)
32+
fedml_inference_runner.run()

python/examples/deploy/dummy_job/config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@ workspace: "./src"
22
entry_point: "serve_main.py"
33
bootstrap: |
44
echo "Bootstrap start..."
5-
sleep 15
6-
echo "Bootstrap finished"
5+
sleep 5
6+
echo "Bootstrap finished"

python/examples/deploy/dummy_job/config/bootstrap.sh

Lines changed: 0 additions & 12 deletions
This file was deleted.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
### don't modify this part ###
2+
set -x
3+
##############################
4+
5+
pip install -r requirements.txt
6+
echo "Bootstrap finished."
7+
8+
### don't modify this part ###
9+
exit 0
10+
##############################
File renamed without changes.

python/examples/launch/train_build_package/train_job.yaml

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Local directory where your source code resides.
22
# It should be the relative path to this job yaml file or the absolute path.
33
# If your job doesn't contain any source code, it can be empty.
4-
workspace: .
4+
workspace: "./src"
55

66
# Running entry commands which will be executed as the job entry point.
77
# If an error occurs, you should exit with a non-zero code, e.g. exit 1.
@@ -14,14 +14,11 @@ job_type: train # options: train, deploy, federate
1414

1515
# Bootstrap shell commands which will be executed before running entry commands.
1616
# Support multiple lines, which can be empty.
17-
bootstrap: |
18-
echo "Bootstrap finished."
17+
bootstrap: bash bootstrap.sh
1918

2019
computing:
2120
minimum_num_gpus: 1 # minimum # of GPUs to provision
2221
maximum_cost_per_hour: $3000 # max cost per hour for your job per gpu card
23-
#allow_cross_cloud_resources: true # true, false
24-
#device_type: CPU # options: GPU, CPU, hybrid
2522
resource_type: A100-80G # e.g., A100-80G, please check the resource type list by "fedml show-resource-type" or visiting URL: https://open.fedml.ai/accelerator_resource_type
2623

2724
data_args:
@@ -36,4 +33,4 @@ model_args:
3633
output_dim: '10'
3734

3835
training_params:
39-
learning_rate: 0.004
36+
learning_rate: 0.004

python/examples/train/llm_train/README.md

Lines changed: 52 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22
<img src="assets/fedml_logo_light_mode.png" width="400px" alt="FedML logo">
33
</div>
44

5-
# LLM Fine-tune
5+
# LLM Training
66

77
This repo contains an MLOps-supported training pipeline to help users build their own large language model (LLM) on proprietary/private
88
data.
99
This repo aims to provide a minimalist example of efficient LLM training/fine-tuning
10-
and to illustrate how to use FedML Launch and fine-tuning.
10+
and to illustrate how to use FEDML Launch.
1111
We leverage Pythia 7B by default and recently added support for Llama 2.
1212

1313
The repo contains:
@@ -18,41 +18,16 @@ The repo contains:
1818
- Supports [DeepSpeed](https://www.deepspeed.ai/).
1919
- Dataset implementation with [datasets](https://huggingface.co/docs/datasets/index).
2020

21-
## How to Use Llama 2
22-
23-
Our example uses Pythia by default, but we recently added support for Llama2.
24-
If you'd like to use Llama2, please see the following instructions before getting started.
25-
26-
To use [Llama 2](https://ai.meta.com/llama/), you need to apply access from Meta and request Meta's private
27-
Hugging Face repo access.
28-
29-
1. Make sure your `transformers` version is `4.31.0` or newer. You could update your transformers via
30-
`pip install --upgrade transformers`.
31-
2. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and apply for
32-
access.
33-
3. Apply for [Meta's private repo](https://huggingface.co/meta-llama/Llama-2-7b-hf)
34-
on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-hf). See below image for detail.
35-
![Meta's private repo on Hugging Face](assets/Llama/huggingface_llama_repo.png)
36-
4. Once both access are granted, you can start using Llama by passing `--model_name "meta-llama/Llama-2-7b-hf"` to the training script.
37-
38-
> **Warning**
39-
> Since Llama 2 is on a private Hugging Face repo, you need to either login to Hugging Face or provide your access token.
40-
> - To login to huggingface (see https://huggingface.co/settings/tokens for detail), run `huggingface-cli login` in
41-
command line.
42-
> - To pass an access token, you need to do one of the following:
43-
> - Set environment variable `HUGGING_FACE_HUB_TOKEN="<your access token>"`
44-
> - For centralized/conventional training, pass `--auth_token "<your access token>"` in the command line.
45-
4621
## Getting Started
4722

4823
Clone the repo then go to the project directory:
4924

5025
```shell
5126
# clone the repo
52-
git clone https://github.com/FedML-AI/llm-finetune.git
27+
git clone https://github.com/FedML-AI/FedML.git
5328

5429
# go to the project directory
55-
cd llm-finetune
30+
cd python/examples/train/llm_train
5631
```
5732

5833
Install dependencies with the following command:
@@ -63,7 +38,7 @@ pip install -r requirements.txt
6338

6439
See [Dependencies](#dependencies) for more information on the dependency versions.
6540

66-
### Conventional/Centralized Training
41+
### Training
6742

6843
The [`run_train.py`](run_train.py) contains a minimal example for conventional/centralized LLM training and fine-tuning
6944
on [`databricks-dolly-15k`](https://huggingface.co/datasets/FedML/databricks-dolly-15k-niid) dataset.
@@ -84,6 +59,9 @@ bash scripts/train_deepspeed.sh \
8459
... # additional arguments
8560
```
8661

62+
> **Note**
63+
> You can use `bash scripts/train.sh -h` to list all the supported CLI options.
64+
8765
> **Note**
8866
> If you have an Amper or newer GPU (e.g., RTX 3000 series or newer), you could turn on **bf16** to have more
8967
> efficient training by passing `--bf16 "True"` in the command line.
@@ -92,20 +70,53 @@ bash scripts/train_deepspeed.sh \
9270
> when using PyTorch DDP with LoRA and gradient checkpointing, you need to turn off `find_unused_parameters`
9371
> by passing `--ddp_find_unused_parameters "False"` in the command line.
9472
73+
### Train with FEDML Launch
74+
75+
If you have trouble finding computing resources, you can launch your training job via [FEDML Launch](https://doc.fedml.ai/launch) and left FEDML to find the most cost-effective resource for your task.
76+
77+
```shell
78+
# install fedml library
79+
pip3 install fedml
80+
81+
# launch your training job
82+
fedml launch job.yaml
83+
```
84+
85+
You can modify the training command in [job.yaml](job.yaml) by
86+
- specify training settings in `job` section
87+
- specify environment setup settings in `bootstrap` section
88+
- specify compute resources in `computing` section
89+
90+
## How to Use Llama 2
91+
92+
Our example uses Pythia by default, but we recently added support for Llama2.
93+
If you'd like to use Llama2, please see the following instructions before getting started.
94+
95+
To use [Llama 2](https://ai.meta.com/llama/), you need to apply access from Meta and request Meta's private
96+
Hugging Face repo access.
97+
98+
1. Make sure your `transformers` version is `4.31.0` or newer. You could update your transformers via
99+
`pip install --upgrade transformers`.
100+
2. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and apply for
101+
access.
102+
3. Apply for [Meta's private repo](https://huggingface.co/meta-llama/Llama-2-7b-hf)
103+
on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-hf). See below image for detail.
104+
![Meta's private repo on Hugging Face](assets/Llama/huggingface_llama_repo.png)
105+
4. Once both access are granted, you can start using Llama by passing `--model_name "meta-llama/Llama-2-7b-hf"` to the training script.
106+
107+
> **Warning**
108+
> Since Llama 2 is on a private Hugging Face repo, you need to either login to Hugging Face or provide your access token.
109+
> - To login to huggingface (see https://huggingface.co/settings/tokens for detail), run `huggingface-cli login` in
110+
command line.
111+
> - To pass an access token, you need to do one of the following:
112+
> - Set environment variable `HUGGING_FACE_HUB_TOKEN="<your access token>"`
113+
> - For centralized/conventional training, pass `--auth_token "<your access token>"` in the command line.
114+
115+
95116
### Dependencies
96117

97118
We have tested our implement with the following setup:
98119

99120
- Ubuntu `20.04.5 LTS` and `22.04.2 LTS`
100121
- CUDA `12.2`, `11.8`, `11.7` and `11.6`
101-
- Python `3.8.13` and `3.9.16`
102-
- `fedml>=0.8.4a7`
103-
- `torch>=2.0.0,<=2.1.0`
104-
- `torchvision>=0.15.1,<=0.16.0`
105-
- `transformers>=4.31.0,<=4.34.0`
106-
- `peft>=0.4.0,<=0.5.0`
107-
- `datasets>=2.11.0,<=2.14.5`
108-
- `deepspeed>=0.9.1,<=0.10.3`
109-
- `numpy>=1.24.3,<=1.24.4`
110-
- `tensorboard>=2.12.2,<=2.13.0`
111-
- `mpi4py>=3.1.4,<=3.1.5`
122+
- Python `3.8.13`, `3.9.16` and `3.10.13`

0 commit comments

Comments
 (0)