Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a GPT-NeoX multi node example #195

Merged
merged 7 commits into from
Feb 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ jobs:
- resnet_imagenet
- stable_diffusion
- nemo
- gpt_neox
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ pip install -e ".[llm]" # or pip install -e ".[llm-cpu]" if no NVIDIA GPU
cd examples/llm # cd into the specific example's folder
```

Available examples include `llm`, `stable-diffusion`, `resnet-imagenet`, `resnet-cifar`, `bert`, `deeplab`, and `nemo`.
Available examples include `llm`, `stable-diffusion`, `resnet-imagenet`, `resnet-cifar`, `bert`, `deeplab`, `nemo`, and `gpt-neox`.

## Extending an example

Expand Down
43 changes: 43 additions & 0 deletions examples/gpt_neox/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# GPT-NeoX on the Mosaic Platform

[The Mosaic platform](https://www.mosaicml.com/blog/mosaicml-cloud-demo) enables easy training of distributed machine learning (ML) jobs. In this folder, we provide an example of how to run [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), EleutherAI's library for training large language models, on the Mosaic platform.

You’ll find in this folder:

- `multi_node.yaml` - a yaml to run a multi-node GPT-NeoX training job on the Mosaic platform.

## Prerequisites

Here’s what you’ll need to get started with running GPT-NeoX on the Mosaic platform

- A docker image with the correctly installed GPT-NeoX dependencies (we tested with `shivanshupurohit/gpt-neox:112`).
- A dataset prepared in the [expected format](https://github.com/EleutherAI/gpt-neox/blob/72c80715c366cc4ad623050d6bcb984fe6638814/README.md?plain=1#L122).

## Starting Training
We include the `.yaml` file required to run multi-node GPT-NeoX on the Mosaic platform. You just need to fill in the `cluster` field in the `.yaml` files, and change the `data-path` if using your own data. If you are using Weights & Biases, fill in `wandb_project` and `wandb_team`, otherwise remove the Weights & Biases related arguments (`use_wandb`, `wandb_project`, `wandb_team`, and `wandb_group`). The other GPT-NeoX configs can be modified as usual. The provided yaml file uses 16 GPUs, but all you have to do to use more is change the `gpu_num` field. You will likely want to adjust the parallelism configuration for your exact setup. See the [GPT-NeoX README](https://github.com/EleutherAI/gpt-neox/blob/main/README.md) for more information.

************Multi-Node Jobs************

Running a multi-node job is as simple as running `mcli run -f multi_node.yaml`.

There are a lot of logs emitted, but early on you should see something like

```
[2023-02-27 23:33:57,571] [INFO] [launch.py:82:main] WORLD INFO DICT: {'node-0': [0, 1, 2, 3, 4, 5, 6, 7], 'node-1': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-02-27 23:33:57,571] [INFO] [launch.py:88:main] nnodes=2, num_local_procs=8, node_rank=0
[2023-02-27 23:33:57,571] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(<class 'list'>, {'node-0': [0, 1, 2, 3, 4, 5, 6, 7], 'node-1': [8, 9, 10, 11, 12, 13, 14, 15]})
[2023-02-27 23:33:57,571] [INFO] [launch.py:104:main] dist_world_size=16
[2023-02-27 23:33:57,571] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
```

and then once training has started

```
%comms: 4.818873821517855
%optimizer_step 0.7121011008320476
%forward: 19.83046088011732
%backward: 69.33287269883512
[2023-02-27 23:34:53,987] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 7.57 | forward: 519.08 | backward_microstep: 1815.10 | backward: 1814.87 | backward_inner_microstep: 1814.17 | backward_inner: 1813.91 | backward_allreduce_microstep: 0.35 | backward_allreduce: 0.12 | reduce_tied_grads: 0.33 | comms: 126.14 | reduce_grads: 125.80 | step: 18.64 | _step_clipping: 0.10 | _step_step: 17.40 | _step_zero_grad: 0.36 | _step_check_overflow: 0.21
[2023-02-27 23:34:56,819] [INFO] [logging.py:60:log_dist] [Rank 0] step=30, skipped=20, lr=[1.8749999999999998e-06, 1.8749999999999998e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
steps: 30 loss: 10.7133 iter time (s): 0.283 samples/sec: 226.457
```
72 changes: 72 additions & 0 deletions examples/gpt_neox/multi_node.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
run_name: neox-multi-node
image: shivanshupurohit/gpt-neox:112 # Docker image provided by EleutherAI
gpu_num: 16
cluster: # ADD YOUR CLUSTER HERE

integrations:
- integration_type: git_repo
git_repo: EleutherAI/gpt-neox
git_commit: 72c80715c366cc4ad623050d6bcb984fe6638814 # main as of 02-27-2023
path: /workspace/gpt-neox
- integration_type: git_repo
git_repo: EleutherAI/DeeperSpeed
git_commit: 7069d10d2c9abac50576c84cb7e45910fafa218c # main as of 02-27-2023
path: /workspace/DeeperSpeed

command: |
# Install the requirements for GPT-NeoX
cd /workspace/gpt-neox
pip install -r requirements/requirements.txt

# install EleutherAI's fork of deepspeed
cd /workspace/DeeperSpeed
pip install .

# create a fake hostfile so that GPT-NeoX and DeepSpeed understand the cluster shape
# Note: this assumes that all nodes have the same number of devices
python -c '
import os; \
import torch; \
filehandle = open("/tmp/deepspeed_mvapich_hostfile", "w"); \
world_size = os.environ["WORLD_SIZE"]; \
device_count = torch.cuda.device_count(); \
num_nodes = int(world_size) // device_count; \
_ = [filehandle.write(f"node-{node} slots={device_count}\\n") for node in range(num_nodes)]; \
'

# create a GPT-NeoX config file for data paths, eval split, wandb setup, and launcher
cd /workspace/gpt-neox/configs
python -c '
import json; \
import os; \
filehandle = open("extra-configs.yml", "w"); \
values = { \
"data-path": "data/enwik8/enwik8_text_document", \
"use_shared_fs": False, \
"vocab-file": "data/gpt2-vocab.json", \
"merge-file": "data/gpt2-merges.txt", \
"eval-interval": 100, \
"eval-iters": 100, \
"split": "949,50,1", \
"use_wandb": True, \
"wandb_project": <!!! your wandb project name here !!!>, \
"wandb_team": <!!! your wandb team name here !!!>, \
"wandb_group": os.environ["RUN_NAME"], \
"launcher": "mosaicml" \
}; \
json.dump(values, filehandle); \
'

cd /workspace/gpt-neox

# download and prepare data
# see https://github.com/EleutherAI/gpt-neox/blob/72c80715c366cc4ad623050d6bcb984fe6638814/README.md?plain=1#L122)
# for more details on the command
python prepare_data.py enwik8 -d ./data

# run training
# see https://github.com/EleutherAI/gpt-neox/blob/72c80715c366cc4ad623050d6bcb984fe6638814/README.md?plain=1#L216)
# for more details on the command
# see https://github.com/EleutherAI/gpt-neox/blob/72c80715c366cc4ad623050d6bcb984fe6638814/README.md?plain=1#L112
# for more details on configuration
./deepy.py train.py configs/125M-json.yml configs/extra-configs.yml --hostfile /tmp/deepspeed_mvapich_hostfile