Adding the new feature of FPDT (#441) #11

saforem2 · 2024-12-06T05:04:19Z

pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1
add FPDT support; add Ulysses rotary position embedding support
add FPDT support; add Ulysses rotary position embedding support
add FPDT support; add Ulysses rotary position embedding support
add FPDT support; add Ulysses rotary position embedding support
remove unnecessary files
set the warmup length to be FPDT chunk size if enabled

Summary by Sourcery

Add support for FPDT to improve model performance with large batch sizes and introduce Ulysses rotary position embedding for better positional encoding. Include a new script for pretraining GPT-3 models with FPDT, and optimize warmup length to match FPDT chunk size.

New Features:

Introduce FPDT (Fast Parallel Distributed Transformer) support to enhance model performance with large batch sizes.
Add Ulysses rotary position embedding support for improved positional encoding.

Enhancements:

Set the warmup length to match the FPDT chunk size when enabled, optimizing initialization.

Build:

Add a new script for pretraining GPT-3 models with FPDT support, including configuration for various model sizes and training parameters.

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * remove unnecessary files * set the warmup length to be FPDT chunk size if enabled --------- Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>

sourcery-ai · 2024-12-06T05:04:23Z

Reviewer's Guide by Sourcery

This PR implements FPDT (Fast Path Decoupled Training) support and Ulysses rotary position embedding in the transformer model. The main changes include adding FPDT functionality for sequence parallel processing, modifying the attention mechanism to support FPDT, and updating the rotary position embedding implementation.

Sequence diagram for FPDT in Transformer forward pass

sequenceDiagram
    participant Transformer
    participant FPDT_FFN
    participant DenseHTo4H
    participant ActivationFunc
    participant Dense4HToH
    Transformer->>FPDT_FFN: apply(hidden_states, dense_h_to_4h.weight, dense_h_to_4h.bias, dense_4h_to_h.weight, dense_4h_to_h.bias, add_bias, fpdt_FFN_chunk_size)
    alt FPDT not enabled
        Transformer->>DenseHTo4H: dense_h_to_4h(hidden_states)
        DenseHTo4H-->>Transformer: intermediate_parallel, bias_parallel
        Transformer->>ActivationFunc: activation_func(intermediate_parallel)
        ActivationFunc-->>Transformer: intermediate_parallel
        Transformer->>Dense4HToH: dense_4h_to_h(intermediate_parallel)
        Dense4HToH-->>Transformer: output, output_bias
    end
    Transformer-->>FPDT_FFN: output, output_bias

Sequence diagram for Rotary Position Embedding

sequenceDiagram
    participant ParallelTransformerLayer
    participant RotaryEmbedding
    participant ApplyRotaryPosEmb
    ParallelTransformerLayer->>RotaryEmbedding: rotary_pos_emb(seq_length)
    RotaryEmbedding-->>ParallelTransformerLayer: rotary_pos_emb_cos, rotary_pos_emb_sin
    ParallelTransformerLayer->>ApplyRotaryPosEmb: apply_rotary_pos_emb(query_layer, rotary_pos_emb_cos)
    ParallelTransformerLayer->>ApplyRotaryPosEmb: apply_rotary_pos_emb(key_layer, rotary_pos_emb_sin)

Class diagram for FPDT and Rotary Position Embedding

classDiagram
    class Transformer {
        +bool ds_sequence_parallel_fpdt
        +int fpdt_FFN_chunk_size
        +forward(hidden_states)
    }
    class ParallelTransformerLayer {
        +bool ds_sequence_parallel_fpdt
        +bool ds_sequence_parallel_fpdt_offloading
        +self_attention
        +forward(hidden_states, attention_mask, inference_params, rotary_pos_emb)
    }
    class RotaryEmbedding {
        +Tensor inv_freq
        +float theta
        +forward(max_seq_len, offset)
    }
    class FPDT_Attention {
        +Parameter qkv_linear_weight
        +Parameter qkv_linear_bias
        +Parameter qkv_dense_weight
        +Parameter qkv_dense_bias
        +FPDT_Attention(config, qkv_linear_weight, qkv_linear_bias, qkv_dense_weight, qkv_dense_bias, sequence_process_group, gather_idx, return_bias, chunk_size, enable_offloading)
    }
    Transformer --> FPDT_Attention
    ParallelTransformerLayer --> FPDT_Attention
    ParallelTransformerLayer --> RotaryEmbedding

File-Level Changes

Change	Details	Files
Added FPDT (Fast Path Decoupled Training) support for sequence parallel processing	Added FPDT configuration options and parameters Implemented FPDT chunk size handling for sequence parallelism Added FPDT FFN and Attention implementations Modified warmup function to use FPDT chunk size when enabled	`megatron/model/transformer.py` `megatron/arguments.py` `megatron/initialize.py`
Modified attention mechanism to support FPDT and batch processing	Updated query/key/value tensor handling for FPDT Added support for batch dimension index in distributed attention Modified attention mask and bias handling for FPDT compatibility	`megatron/model/transformer.py`
Updated rotary position embedding implementation	Modified buffer handling for rotary embeddings Added support for cos/sin separate tensors Updated position embedding application in attention mechanism	`megatron/model/rotary_pos_embedding.py` `megatron/model/language_model.py`
Added example configuration for FPDT training	Created new training script for GPT model with FPDT Added configuration parameters for FPDT chunk size and offloading Included sequence parallel size settings	`examples_deepspeed/sequence_parallel/ds_pretrain_gpt_6.7B_fpdt_32k.sh`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time. You can also use
this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @saforem2 - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider adding documentation explaining the FPDT optimization approach and when it should be used vs standard execution

Here's what I looked at during the review

🟡 General issues: 2 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2024-12-06T05:05:44Z

megatron/model/rotary_pos_embedding.py

        super().__init__()
        inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
-        self.register_buffer('inv_freq', inv_freq)
+        self.inv_freq = inv_freq.to(get_accelerator().current_device_name())


issue (performance): Moving from register_buffer to regular tensor attribute will cause unnecessary tensor recreation

Consider keeping this as a register_buffer to avoid recreating the tensor on every forward pass

sourcery-ai · 2024-12-06T05:05:44Z

megatron/initialize.py

        dtype = torch.float32

    # Warmup fused bias+gelu
+    seq_length = args.seq_length


issue: Sequence length divisibility requirements should be validated explicitly

Add explicit validation that sequence length is divisible by required factors to fail fast with a clear error message

sourcery-ai · 2024-12-06T05:05:44Z

megatron/model/transformer.py

        # Query, Key, and Value
        # =====================
-
        if self.attention_type == AttnType.self_attn:


issue (complexity): Consider refactoring the parallel attention implementation to use a cleaner class hierarchy with FPDT as the primary path

The parallel attention implementation has become overly complex with duplicate code paths. Consider refactoring to make FPDT the primary implementation:

class ParallelAttention: def __init__(self, config, ...): self.use_legacy = not config.use_fpdt self.fpdt_attention = FPDTAttention(...) if not self.use_legacy else None self.legacy_attention = LegacyAttention(...) if self.use_legacy else None def forward(self, hidden_states, attention_mask, ...): if self.use_legacy: return self.legacy_attention(hidden_states, attention_mask, ...) return self.fpdt_attention(hidden_states, attention_mask, ...)

This separates the implementations while maintaining compatibility. The FPDT path removes unnecessary dropout/bias complexity. Consider deprecating the legacy path in future versions.

* [tools]GQA convert support * fix readme

Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.

Signed-off-by: Logan Adams <loadams@microsoft.com>

…run successfully with DeepSpeed (#468) Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Schwidola0607 <khoadangpham82944@gmail.com>

…nabled (#479) * pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add fused_rms_norm support on XPU device (#431) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [LLaMa] Adding support converting checkpoint from mds to hf (#432) * add support converting checkpoint from hf to mds * Fix PP issue * update Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add device check when import ipex (#436) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix TFLOPs calculation (#371) * fix TFLOPs calculation when GQA used, we observe right TFLOPs after this fix. when GQA is not used, huge difference in TFLOPs is solved with selective recompute . some other minor difference will also be observed as logits macs also added. * add copyrights Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix nan issue when running megatron-deepspeed (#434) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable empty cache on XPU device (#438) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [wandb] disable wandb more gracefully (#422) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [Bug] Fix crash when logging optimizer state to tb (#417) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * remove unnecessary files Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * set the warmup length to be FPDT chunk size if enabled Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Enable Sequence Parallelism (#429) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for rms_norm in squence_parallel (#448) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable profiler for specific ranks (#451) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for silently ignoring the deepspeed config (#452) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix moe tflops (#445) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [tool]GQA convert support (#454) * [tools]GQA convert support * fix readme Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Fix import error in `deepspeed_to_megatron.py` (#455) Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`. Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Update references to new GitHub org (deepspeedai) (#462) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468) Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix bug when FPDT is disabled but with original Ulysses Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: jinghan yao yjhmitweb@gmail.com Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> --------- Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: jinghan yao yjhmitweb@gmail.com Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn> Co-authored-by: billishyahao <yahao.he@gmail.com> Co-authored-by: Polisetty V R K Jyothendra Varma <jvarma@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu> Co-authored-by: ranzhejiang <zhejiang.ran@intel.com> Co-authored-by: Xinyu Lian <lian7@illinois.edu> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: hotsuyuki <hotsuyuki.kawanishi@gmail.com> Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

sourcery-ai bot reviewed Dec 6, 2024

View reviewed changes

inkcherry and others added 7 commits December 18, 2024 10:04

[tool]GQA convert support (#454)

c3df187

* [tools]GQA convert support * fix readme

Fix import error in deepspeed_to_megatron.py (#455)

f4157be

Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.

Update references to new GitHub org (deepspeedai) (#462)

3e1da1f

Signed-off-by: Logan Adams <loadams@microsoft.com>

add sequence_parallel in layernorm init to enable 3D parallelism can …

4efb479

…run successfully with DeepSpeed (#468) Signed-off-by: yisheng <yi.sheng@intel.com>

fix the error issue for q/k/v stride is not match (#469)

8860868

Signed-off-by: yisheng <yi.sheng@intel.com>

add instruction document for huggingface UCP (#477)

1d71682

Signed-off-by: Schwidola0607 <khoadangpham82944@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding the new feature of FPDT (#441) #11

Adding the new feature of FPDT (#441) #11

Uh oh!

saforem2 commented Dec 6, 2024 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Dec 6, 2024 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Dec 6, 2024

Uh oh!

sourcery-ai bot Dec 6, 2024

Uh oh!

sourcery-ai bot Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Adding the new feature of FPDT (#441) #11

Are you sure you want to change the base?

Adding the new feature of FPDT (#441) #11

Uh oh!

Conversation

saforem2 commented Dec 6, 2024 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide by Sourcery

Sequence diagram for FPDT in Transformer forward pass

Sequence diagram for Rotary Position Embedding

Class diagram for FPDT and Rotary Position Embedding

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

saforem2 commented Dec 6, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 6, 2024 •

edited

Loading