Resize embeds (with Deepspeed) is still not fixed in version 4.43.3

### System Info
- Hardware used: NVIDIA A6000 48G, A100 80G
- Base models used: Mistral-7B-v0.3, Llama-3.0/1-8B
- accelerate: 0.32.0
- deepspeed: 0.14.4

### Who can help?

@ArthurZucker @LysandreJik

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

`resize_token_embeddings` still does not work in version 4.43.3. Although PR #32214 resolved the issue, it seems that the actual patch does not include this PR, despite being mentioned in the latest patch notes. I confirmed that the test scripts below still do not work in 4.43.3 but do work in the main branch that includes the PR.

| Relevant PR | Issue Resolved | Mentioned in Patch Notes | Actually Included in Patch |
| --- | --- | --- | --- |
| #32192 | ✘ | ✔️ ([4.43.2](https://github.com/huggingface/transformers/releases/tag/v4.43.2)) | ✔️ (4.43.2) |
| #32214 | ✔️ | ✔️ ([4.43.3](https://github.com/huggingface/transformers/releases/tag/v4.43.3)) | ✘ (4.43.3; [comparison link (vs 4.43.2)](https://github.com/huggingface/transformers/compare/v4.43.2...v4.43.3)) |

 If I `resize` the `token embedding` to be greater than or equal to the original vocab size, `vocab_size` is set to zero. Otherwise, another error occurs: `RuntimeError: start (0) + length (525336576) exceeds dimension size (524943360)`. 

`test.sh`:
```
CUDA_VISIBLE_DEVICES=0 accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes 1 \
    --use_deepspeed \
    --deepspeed_config_file test_ds_config.conf \
    test.py
```

`test.py`:
```python
from transformers import AutoModelForCausalLM
from accelerate import Accelerator

accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

print(f"Model Config 1: {model.config}")
model.resize_token_embeddings(model.vocab_size + 100, pad_to_multiple_of=8)
print(f"Model Config 2: {model.config}")
```

`test_ds_config.conf`:
```
{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
```

`output`:
```
Model Config 1: LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3.1-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

Model Config 2: LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3.1-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 0
}
```

### Expected behavior

correctly update `vocab_size`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resize embeds (with Deepspeed) is still not fixed in version 4.43.3 #32287

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Relevant PR	Issue Resolved	Mentioned in Patch Notes	Actually Included in Patch
#32192	✘	✔️ (4.43.2)	✔️ (4.43.2)
#32214	✔️	✔️ (4.43.3)	✘ (4.43.3; comparison link (vs 4.43.2))

Resize embeds (with Deepspeed) is still not fixed in version 4.43.3 #32287

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions