Skip to content

Resize embeds (with Deepspeed) is still not fixed in version 4.43.3 #32287

@seokhyunan

Description

@seokhyunan

System Info

  • Hardware used: NVIDIA A6000 48G, A100 80G
  • Base models used: Mistral-7B-v0.3, Llama-3.0/1-8B
  • accelerate: 0.32.0
  • deepspeed: 0.14.4

Who can help?

@ArthurZucker @LysandreJik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

resize_token_embeddings still does not work in version 4.43.3. Although PR #32214 resolved the issue, it seems that the actual patch does not include this PR, despite being mentioned in the latest patch notes. I confirmed that the test scripts below still do not work in 4.43.3 but do work in the main branch that includes the PR.

Relevant PR Issue Resolved Mentioned in Patch Notes Actually Included in Patch
#32192 ✔️ (4.43.2) ✔️ (4.43.2)
#32214 ✔️ ✔️ (4.43.3) ✘ (4.43.3; comparison link (vs 4.43.2))

If I resize the token embedding to be greater than or equal to the original vocab size, vocab_size is set to zero. Otherwise, another error occurs: RuntimeError: start (0) + length (525336576) exceeds dimension size (524943360).

test.sh:

CUDA_VISIBLE_DEVICES=0 accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes 1 \
    --use_deepspeed \
    --deepspeed_config_file test_ds_config.conf \
    test.py

test.py:

from transformers import AutoModelForCausalLM
from accelerate import Accelerator

accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

print(f"Model Config 1: {model.config}")
model.resize_token_embeddings(model.vocab_size + 100, pad_to_multiple_of=8)
print(f"Model Config 2: {model.config}")

test_ds_config.conf:

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

output:

Model Config 1: LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3.1-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

Model Config 2: LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3.1-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 0
}

Expected behavior

correctly update vocab_size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions