[Examples] NeMo distributed training for BERT and GPT3 #2533

romilbhardwaj · 2023-09-08T21:29:46Z

Starter example showing how to run Nvidia NeMo on SkyPilot for fine-tuning a BERT model on GLUE tasks and training a GPT style model on wikipedia dataset.

Michaelvll

Nice! Great to see that we support NeMo out of the box. Left several minor comments. : )

Michaelvll · 2023-09-08T21:37:13Z

examples/nemo/nemo.yaml

+  if [ $? -eq 0 ]; then
+      echo "conda env exists"
+  else
+      conda create -y --name nemo python==3.10.12


I thought conda should only accept single =? Also, is the minor version required?

Ah, this is from NeMo's official install instructions, but if you'd like to use just python=3.10, I can change it.

Ahh, I see. I remembered == was not supported by conda, but I think it is fine to follow their official instruction if that works.
(The conda's doc mentions python=3.8)

Michaelvll · 2023-09-08T21:39:35Z

examples/nemo/nemo.yaml

+
+      # Install nemo
+      sudo apt-get update
+      sudo apt-get install -y libsndfile1 ffmpeg


Seems we installs ffmpeg, but is training on language tasks. Should we train on some CV tasks instead?

This is also from the official install instructions, I thought of leaving it in if people use NeMo for multi-modal tasks. Let me know if you want to remove it.

I would prefer to remove these to keep our setup minimal. I think we can have another example yaml for multi-modal task that include these commands. Wdyt?

examples/nemo/nemo.yaml

Michaelvll

Thanks for adding the example @romilbhardwaj! LGTM.

Michaelvll · 2023-09-09T20:12:13Z

examples/nemo/nemo_gpt3.yaml

+num_nodes: 2
+
+envs:
+  DATASET_ROOT: $HOME/wiki/


Is this $HOME work for remote cluster that do not have the same username as the local machine?

Yes, the $HOME is expanded on the remote machine (i.e., this results in /home/sky/wiki/ on a k8s cluster rather than /Users/romilb/wiki).

… into nemo_example

romilbhardwaj · 2023-10-11T22:18:37Z

Added a note on using GCS when mounting dataset bucket, since goofys fails with "transport endpoint is not connected" error. Tested on GKE and GCP, merging now.

romilbhardwaj added 4 commits September 8, 2023 14:26

nemo

67d9950

newline

e0407d7

newline

5f17d14

add reference

672b6ed

Michaelvll reviewed Sep 8, 2023

View reviewed changes

romilbhardwaj added 2 commits September 8, 2023 15:19

nproc

6682677

Update docs

6923aa0

concretevitamin reviewed Sep 8, 2023

View reviewed changes

examples/nemo/nemo.yaml Outdated Show resolved Hide resolved

romilbhardwaj added 3 commits September 8, 2023 15:47

more launch

2a08b0b

add gpt3

02b248a

add gpt3

9f59474

romilbhardwaj changed the title ~~[Examples] NeMo distributed finetuning on GLUE~~ [Examples] NeMo distributed finetuning for BERT and GPT3 Sep 9, 2023

romilbhardwaj changed the title ~~[Examples] NeMo distributed finetuning for BERT and GPT3~~ [Examples] NeMo distributed training for BERT and GPT3 Sep 9, 2023

romilbhardwaj and others added 14 commits September 10, 2023 18:39

wp

ff9d5c9

trainonly working

d57d417

wip

fcf5ed5

add nemo distributed training and preprocessing scripts

524531f

add nemo distributed training and preprocessing scripts

bda066e

lint

cd44a2b

fixes

082a734

Use A100

cbcf38f

Add -s flag

33dabd6

fix params

5c7fe1e

update run time

225f2c7

add gsutil install

75900bf

Install gsutil

98720ca

Move conda activate to above gsutil

9b36014

romilbhardwaj mentioned this pull request Sep 29, 2023

[k8s] Multi-node support for Kubernetes #2609

Merged

7 tasks

force gcs. s3 doesn't work.

a8d7cbd

Michaelvll approved these changes Oct 3, 2023

View reviewed changes

romilbhardwaj added 5 commits October 5, 2023 12:20

rename

8b9ee4e

Merge branch 'master' into nemo_example

9b13a22

Merge branch 'nemo_example' of https://github.com/skypilot-org/skypilot…

1ff8b30

… into nemo_example

add hints

0f6699c

Add note on GCS

a68ae9a

romilbhardwaj merged commit 1eea3b8 into master Oct 11, 2023
18 checks passed

romilbhardwaj deleted the nemo_example branch October 11, 2023 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Examples] NeMo distributed training for BERT and GPT3 #2533

[Examples] NeMo distributed training for BERT and GPT3 #2533

romilbhardwaj commented Sep 8, 2023 •

edited

Loading

Michaelvll left a comment

Michaelvll Sep 8, 2023

romilbhardwaj Sep 8, 2023

Michaelvll Sep 9, 2023

Michaelvll Sep 8, 2023

romilbhardwaj Sep 8, 2023

Michaelvll Sep 9, 2023

Michaelvll left a comment

Michaelvll Sep 9, 2023

romilbhardwaj Oct 4, 2023 •

edited

Loading

romilbhardwaj commented Oct 11, 2023

[Examples] NeMo distributed training for BERT and GPT3 #2533

[Examples] NeMo distributed training for BERT and GPT3 #2533

Conversation

romilbhardwaj commented Sep 8, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Sep 8, 2023

Choose a reason for hiding this comment

romilbhardwaj Sep 8, 2023

Choose a reason for hiding this comment

Michaelvll Sep 9, 2023

Choose a reason for hiding this comment

Michaelvll Sep 8, 2023

Choose a reason for hiding this comment

romilbhardwaj Sep 8, 2023

Choose a reason for hiding this comment

Michaelvll Sep 9, 2023

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Sep 9, 2023

Choose a reason for hiding this comment

romilbhardwaj Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

romilbhardwaj commented Oct 11, 2023

romilbhardwaj commented Sep 8, 2023 •

edited

Loading

romilbhardwaj Oct 4, 2023 •

edited

Loading