Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MUSA (Moore Threads GPU) backend in accelerate #2917

Merged
merged 1 commit into from
Jul 10, 2024

Conversation

fmo-mt
Copy link
Contributor

@fmo-mt fmo-mt commented Jul 5, 2024

What does this PR do?

To train 🤗 Transformers models using MUSA (Moore Threads GPU), the support should be added in Accelerate first and then will come in the Trainer for free.

This PR will support MUSA in accelerate, and we just followed how MLU supporting ( #2552 ) was merged.

  1. Sample config after running the accelerate config command:
  1 compute_environment: LOCAL_MACHINE
  2 debug: false
  3 distributed_type: MULTI_MUSA
  4 downcast_bf16: 'no'
  5 gpu_ids: 0,1,2,3,4,5,6,7
  6 machine_rank: 0
  7 main_training_function: main
  8 mixed_precision: 'no'
  9 num_machines: 1
 10 num_processes: 8
 11 rdzv_backend: static
 12 same_network: true
 13 tpu_env: []
 14 tpu_use_cluster: false
 15 tpu_use_sudo: false
 16 use_cpu: false
  1. to train a Bert-large-uncased model using:
accelerate launch run_trainer.py \
    --model_name_or_path ./squad_finetuned_checkpoint \
    --dataset_name ./squad \
    --per_device_train_batch_size 24 \
    --learning_rate 3e-5 \
    --num_train_epochs 50 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --lr_scheduler_type cosine \
    --output_dir ./bert-large-uncased |& tee bert-large-uncased.log

Below are the output logs:

loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file ./squad_finetuned_checkpoint/config.json
Model config BertConfig {
  "_name_or_path": "./squad_finetuned_checkpoint",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.40.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
07/05/2024 16:07:43 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
07/05/2024 16:07:43 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
07/05/2024 16:07:43 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
07/05/2024 16:07:43 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
07/05/2024 16:07:43 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
07/05/2024 16:07:43 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
07/05/2024 16:07:43 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  loss: 4.95459, lr: [2.999990586959975e-05, 2.999990586959975e-05]:   0%|          | 26/23100 [01:11<13:08:05,  2.05s/it]
  1. About MUSA and Moore Threads GPU:

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@fmo-mt
Copy link
Contributor Author

fmo-mt commented Jul 8, 2024

@muellerzr @SunMarc Hi, buddies! Can you take a look at this PR, please?

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @fmo-mt ! The integration looks very clean ! Nice to see a new backend 🔥 Can you have a second look @muellerzr ? Also, i'm not sure if you are on the on team working on torch_musa but if that's the case, it would be great to spin some runners on your side to make sure that we don't have failing accelerate tests with musa hardware

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as Marc, very nice PR @fmo-mt !

@muellerzr
Copy link
Collaborator

For the quality check to pass, please do pip install -e .[quality]; make style; make quality

@fmo-mt
Copy link
Contributor Author

fmo-mt commented Jul 9, 2024

Thanks for the PR @fmo-mt ! The integration looks very clean ! Nice to see a new backend 🔥 Can you have a second look @muellerzr ? Also, i'm not sure if you are on the on team working on torch_musa but if that's the case, it would be great to spin some runners on your side to make sure that we don't have failing accelerate tests with musa hardware

Yes I'm working on torch_musa currently, and we have trained/fine-tuned some models like BERT, Mistral etc.

@fmo-mt
Copy link
Contributor Author

fmo-mt commented Jul 9, 2024

@muellerzr @SunMarc Oh, I fixed a typo and force-pushed with rebase which clean the change history, but it seems that the CI workflow needs to be activate by you guys 🥲

@SunMarc
Copy link
Member

SunMarc commented Jul 10, 2024

No issues ! I'm merging !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants