-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Examples] NeMo distributed training for BERT and GPT3 #2533
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Great to see that we support NeMo out of the box. Left several minor comments. : )
examples/nemo/nemo.yaml
Outdated
if [ $? -eq 0 ]; then | ||
echo "conda env exists" | ||
else | ||
conda create -y --name nemo python==3.10.12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought conda
should only accept single =
? Also, is the minor version required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is from NeMo's official install instructions, but if you'd like to use just python=3.10
, I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. I remembered ==
was not supported by conda, but I think it is fine to follow their official instruction if that works.
(The conda's doc mentions python=3.8
)
examples/nemo/nemo.yaml
Outdated
|
||
# Install nemo | ||
sudo apt-get update | ||
sudo apt-get install -y libsndfile1 ffmpeg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we installs ffmpeg
, but is training on language tasks. Should we train on some CV tasks instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also from the official install instructions, I thought of leaving it in if people use NeMo for multi-modal tasks. Let me know if you want to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to remove these to keep our setup minimal. I think we can have another example yaml for multi-modal task that include these commands. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the example @romilbhardwaj! LGTM.
examples/nemo/nemo_gpt3.yaml
Outdated
num_nodes: 2 | ||
|
||
envs: | ||
DATASET_ROOT: $HOME/wiki/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this $HOME
work for remote cluster that do not have the same username as the local machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the $HOME
is expanded on the remote machine (i.e., this results in /home/sky/wiki/
on a k8s cluster rather than /Users/romilb/wiki
).
Added a note on using GCS when mounting dataset bucket, since goofys fails with "transport endpoint is not connected" error. Tested on GKE and GCP, merging now. |
Starter example showing how to run Nvidia NeMo on SkyPilot for fine-tuning a BERT model on GLUE tasks and training a GPT style model on wikipedia dataset.