-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Closed
Labels
Description
System Info
- Hardware used: NVIDIA A6000 48G, A100 80G
- Base models used: Mistral-7B-v0.3, Llama-3.0/1-8B
- accelerate: 0.32.0
- deepspeed: 0.14.4
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
resize_token_embeddings still does not work in version 4.43.3. Although PR #32214 resolved the issue, it seems that the actual patch does not include this PR, despite being mentioned in the latest patch notes. I confirmed that the test scripts below still do not work in 4.43.3 but do work in the main branch that includes the PR.
| Relevant PR | Issue Resolved | Mentioned in Patch Notes | Actually Included in Patch |
|---|---|---|---|
| #32192 | ✘ | ✔️ (4.43.2) | ✔️ (4.43.2) |
| #32214 | ✔️ | ✔️ (4.43.3) | ✘ (4.43.3; comparison link (vs 4.43.2)) |
If I resize the token embedding to be greater than or equal to the original vocab size, vocab_size is set to zero. Otherwise, another error occurs: RuntimeError: start (0) + length (525336576) exceeds dimension size (524943360).
test.sh:
CUDA_VISIBLE_DEVICES=0 accelerate launch \
--mixed_precision bf16 \
--num_machines 1 \
--num_processes 1 \
--use_deepspeed \
--deepspeed_config_file test_ds_config.conf \
test.py
test.py:
from transformers import AutoModelForCausalLM
from accelerate import Accelerator
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
print(f"Model Config 1: {model.config}")
model.resize_token_embeddings(model.vocab_size + 100, pad_to_multiple_of=8)
print(f"Model Config 2: {model.config}")test_ds_config.conf:
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1e5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
output:
Model Config 1: LlamaConfig {
"_name_or_path": "meta-llama/Meta-Llama-3.1-8B",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.43.3",
"use_cache": true,
"vocab_size": 128256
}
Model Config 2: LlamaConfig {
"_name_or_path": "meta-llama/Meta-Llama-3.1-8B",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.43.3",
"use_cache": true,
"vocab_size": 0
}
Expected behavior
correctly update vocab_size.