Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable autoTP for MPT #3861

Merged
merged 5 commits into from
Jul 27, 2023
Merged

enable autoTP for MPT #3861

merged 5 commits into from
Jul 27, 2023

Conversation

sywangyi
Copy link
Contributor

@sywangyi sywangyi commented Jul 3, 2023

No description provided.

@sywangyi
Copy link
Contributor Author

sywangyi commented Jul 3, 2023

@yao-matrix @delock please help review

@sywangyi
Copy link
Contributor Author

sywangyi commented Jul 3, 2023

@delock
Copy link
Collaborator

delock commented Jul 3, 2023

Hi @tjruwase is @molly-smith owner of AutoTP related changes? There are contribution to enable AutoTP for different LLM model, should we invite Molly to review?

@delock
Copy link
Collaborator

delock commented Jul 3, 2023

@sywangyi
Copy link
Contributor Author

sywangyi commented Jul 3, 2023

@sywangyi can you also add mpt into supported models in docs? https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/automatic-tensor-parallelism.md

done

@tjruwase
Copy link
Contributor

tjruwase commented Jul 3, 2023

Hi @tjruwase is @molly-smith owner of AutoTP related changes? There are contribution to enable AutoTP for different LLM model, should we invite Molly to review?

@delock, sure I will ask for Molly for help. Thanks!

@tjruwase
Copy link
Contributor

tjruwase commented Jul 3, 2023

Hi @tjruwase is @molly-smith owner of AutoTP related changes? There are contribution to enable AutoTP for different LLM model, should we invite Molly to review?

@delock, sure I will ask for Molly for help. Thanks!

@delock, July 4th is this week and so there might be some delay with reviews as team members take some time off. Apologies for the inconvenience.

@delock
Copy link
Collaborator

delock commented Jul 4, 2023

Hi @tjruwase is @molly-smith owner of AutoTP related changes? There are contribution to enable AutoTP for different LLM model, should we invite Molly to review?

@delock, sure I will ask for Molly for help. Thanks!

@delock, July 4th is this week and so there might be some delay with reviews as team members take some time off. Apologies for the inconvenience.

Got it, thanks for reminding!

@molly-smith
Copy link
Contributor

@tjruwase @sywangyi @delock AutoTP creates a dictionary of which model layers require all reduce and is implemented in deepspeed/module_inject/auto_tp.py. This part is actually already functional for MPT. However, I assume the lower level implementation of tensor parallelism is not functional for MPT. I am not the code owner for this part of TP. I think @RezaYazdaniAminabadi is the right person to review these changes.

@tjruwase
Copy link
Contributor

tjruwase commented Jul 5, 2023

@molly-smith, thanks for the clarification.

@RezaYazdaniAminabadi, can you please help review this?

@tjruwase tjruwase added the merge-queue PRs ready to merge label Jul 10, 2023
deepspeed/inference/engine.py Outdated Show resolved Hide resolved
@mrwyattii mrwyattii removed the merge-queue PRs ready to merge label Jul 10, 2023
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@minionKP
Copy link

Hi @sywangyi, I was trying out deepspeed for MPT30B as well and saw you had added the changes.

I just tried out your changes by building deepspeed from source by cloning the repository with changes and doing pip install .

# tried with both replace_with_kernel_inject as True and False
ds_engine = deepspeed.init_inference(model, mp_size=4, dtype=torch.float16, replace_with_kernel_inject=True)
model = ds_engine.module
# evaluate model

I am getting the following error when replace_with_kernel_inject is False

│ a00f1a5307c5ca947e72e89693c/attention.py:41 in scaled_multihead_dot_product_attention            │
│                                                                                                  │
│    38 │   │   attn_bias = attn_bias[:, :, _s_q:, _s_k:]                                          │
│    39 │   │   if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2)    │
│    40 │   │   │   raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to bro   │
│ ❱  41 │   │   attn_weight = attn_weight + attn_bias                                              │
│    42 │   min_val = torch.finfo(q.dtype).min                                                     │
│    43 │   if key_padding_mask is not None:                                                       │
│    44 │   │   if attn_bias is not None:                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (16) must match the size of tensor b (64) at non-singleton dimension 1

and w/o it I am getting OOM with 4xA100 40GB machines.

Anything I should be doing differently?

@sywangyi
Copy link
Contributor Author

Hi @sywangyi, I was trying out deepspeed for MPT30B as well and saw you had added the changes.

I just tried out your changes by building deepspeed from source by cloning the repository with changes and doing pip install .

# tried with both replace_with_kernel_inject as True and False
ds_engine = deepspeed.init_inference(model, mp_size=4, dtype=torch.float16, replace_with_kernel_inject=True)
model = ds_engine.module
# evaluate model

I am getting the following error when replace_with_kernel_inject is False

│ a00f1a5307c5ca947e72e89693c/attention.py:41 in scaled_multihead_dot_product_attention            │
│                                                                                                  │
│    38 │   │   attn_bias = attn_bias[:, :, _s_q:, _s_k:]                                          │
│    39 │   │   if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2)    │
│    40 │   │   │   raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to bro   │
│ ❱  41 │   │   attn_weight = attn_weight + attn_bias                                              │
│    42 │   min_val = torch.finfo(q.dtype).min                                                     │
│    43 │   if key_padding_mask is not None:                                                       │
│    44 │   │   if attn_bias is not None:                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (16) must match the size of tensor b (64) at non-singleton dimension 1

and w/o it I am getting OOM with 4xA100 40GB machines.

Anything I should be doing differently?

I could run mpt30B 4TP successfully in 4 RTX 8000 using AutoTP, seem the attn_bias is incorrect in your side, could you have a check if https://github.com/microsoft/DeepSpeed/pull/3861/files#diff-27be33d45da8a29a59628046c212ecbdb630e85a9bd987a98431272f1472a3fbR63 is correctly called.

@minionKP
Copy link

Hi @sywangyi, I was trying out deepspeed for MPT30B as well and saw you had added the changes.
I just tried out your changes by building deepspeed from source by cloning the repository with changes and doing pip install .

# tried with both replace_with_kernel_inject as True and False
ds_engine = deepspeed.init_inference(model, mp_size=4, dtype=torch.float16, replace_with_kernel_inject=True)
model = ds_engine.module
# evaluate model

I am getting the following error when replace_with_kernel_inject is False

│ a00f1a5307c5ca947e72e89693c/attention.py:41 in scaled_multihead_dot_product_attention            │
│                                                                                                  │
│    38 │   │   attn_bias = attn_bias[:, :, _s_q:, _s_k:]                                          │
│    39 │   │   if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2)    │
│    40 │   │   │   raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to bro   │
│ ❱  41 │   │   attn_weight = attn_weight + attn_bias                                              │
│    42 │   min_val = torch.finfo(q.dtype).min                                                     │
│    43 │   if key_padding_mask is not None:                                                       │
│    44 │   │   if attn_bias is not None:                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (16) must match the size of tensor b (64) at non-singleton dimension 1

and w/o it I am getting OOM with 4xA100 40GB machines.
Anything I should be doing differently?

I could run mpt30B 4TP successfully in 4 RTX 8000 using AutoTP, seem the attn_bias is incorrect in your side, could you have a check if https://github.com/microsoft/DeepSpeed/pull/3861/files#diff-27be33d45da8a29a59628046c212ecbdb630e85a9bd987a98431272f1472a3fbR63 is correctly called.

I did a mistake in my setup with your changes. Can confirm it now works. Thanks

@sywangyi
Copy link
Contributor Author

sywangyi commented Jul 27, 2023

@mrwyattii I have moved the model specific ops to a specific file. could you review the change?

@mrwyattii mrwyattii enabled auto-merge July 27, 2023 20:17
@mrwyattii mrwyattii added this pull request to the merge queue Jul 27, 2023
Merged via the queue into microsoft:master with commit 0bafeac Jul 27, 2023
polisettyvarma pushed a commit to polisettyvarma/DeepSpeed that referenced this pull request Aug 7, 2023
* enable autoTP for MPT

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add model specific func to auto_tp_model_utils.py

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants