-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL when trying to train on 2 nodes #19544
Comments
Did you set difference NODE_RANK to each node ? |
@p208p2002 can you share how you setup Azure to work with lightning across multiple nodes? I have Azure working on multiple GPUs on a single node with DDP, but not across multiple nodes. Would love to see how you did it! |
Sure, but please note that what I use is a compute cluster from Azure ML Studio. First you should create the compute cluster under AML. then create project specific envirement, it's recommend to reference the deepspeed's official docker image next, write some job submit script with AML Python SDK, through the SDK, you can specify which runtime to use and how to start the training program. I can show you some part of my job submit script: from azure.ai.ml import command, Output, MLClient, Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential
...
job = command(
display_name=...
code="./", # local path where the code is stored
environment_variables={
"TRAINING_CONFIG": TRAINING_CONFIG,
"WANDB_API_KEY": WANDB_API_KEY,
"HF_TOKEN": HF_TOKEN,
},
command="python sft_trainer.py --num_nodes ${{inputs.num_nodes}} --gpus_per_node ${{inputs.gpus_per_node}} --log_dir ${{outputs.log_dir}} fit",
inputs=inputs,
outputs=outputs,
environment=os.environ["AZURE_ENVIRONMENT"],
compute=os.environ["AZURE_COUPUTE"],
instance_count=inputs["num_nodes"],
distribution={
"type": "PyTorch",
"process_count_per_instance": inputs["gpus_per_node"],
},
)
# submit the command
ml_client.jobs.create_or_update(job) the code above is not complete, you should done it youself. you can see that in command, I pass the arg Finally, do some little modify for the lightning trainer: # trainer
trainer = Trainer(
num_nodes=args.num_nodes,
devices=args.gpus_per_node,
...
) The last thing you should know is that how distributed training system works. In short, the system provides some environment vaiables to identify master/slave nodes, so that can communicate to each other. This acticle may help: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2 |
Thanks
Thanks @p208p2002 , I got it all working on my cluster. Not experimented with DeepSpeed yet, but that looks like an interesting avenue for speed ups. Not sure my model is large enough to warrant it though (millions not billions of params for mobile). |
Bug description
I am trying to run a very simple training script for 2 nodes and I always get this error:
Output:
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: