Skip to content

Conversation

@saforem2
Copy link
Owner

@saforem2 saforem2 commented Dec 6, 2024

  • pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1

  • add FPDT support; add Ulysses rotary position embedding support

  • add FPDT support; add Ulysses rotary position embedding support

  • add FPDT support; add Ulysses rotary position embedding support

  • add FPDT support; add Ulysses rotary position embedding support

  • remove unnecessary files

  • set the warmup length to be FPDT chunk size if enabled


Summary by Sourcery

Add support for FPDT to improve model performance with large batch sizes and introduce Ulysses rotary position embedding for better positional encoding. Include a new script for pretraining GPT-3 models with FPDT, and optimize warmup length to match FPDT chunk size.

New Features:

  • Introduce FPDT (Fast Parallel Distributed Transformer) support to enhance model performance with large batch sizes.
  • Add Ulysses rotary position embedding support for improved positional encoding.

Enhancements:

  • Set the warmup length to match the FPDT chunk size when enabled, optimizing initialization.

Build:

  • Add a new script for pretraining GPT-3 models with FPDT support, including configuration for various model sizes and training parameters.

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1

* add FPDT support; add Ulysses rotary position embedding support

* add FPDT support; add Ulysses rotary position embedding support

* add FPDT support; add Ulysses rotary position embedding support

* add FPDT support; add Ulysses rotary position embedding support

* remove unnecessary files

* set the warmup length to be FPDT chunk size if enabled

---------

Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu>
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>
@sourcery-ai
Copy link

sourcery-ai bot commented Dec 6, 2024

Reviewer's Guide by Sourcery

This PR implements FPDT (Fast Path Decoupled Training) support and Ulysses rotary position embedding in the transformer model. The main changes include adding FPDT functionality for sequence parallel processing, modifying the attention mechanism to support FPDT, and updating the rotary position embedding implementation.

Sequence diagram for FPDT in Transformer forward pass

sequenceDiagram
    participant Transformer
    participant FPDT_FFN
    participant DenseHTo4H
    participant ActivationFunc
    participant Dense4HToH
    Transformer->>FPDT_FFN: apply(hidden_states, dense_h_to_4h.weight, dense_h_to_4h.bias, dense_4h_to_h.weight, dense_4h_to_h.bias, add_bias, fpdt_FFN_chunk_size)
    alt FPDT not enabled
        Transformer->>DenseHTo4H: dense_h_to_4h(hidden_states)
        DenseHTo4H-->>Transformer: intermediate_parallel, bias_parallel
        Transformer->>ActivationFunc: activation_func(intermediate_parallel)
        ActivationFunc-->>Transformer: intermediate_parallel
        Transformer->>Dense4HToH: dense_4h_to_h(intermediate_parallel)
        Dense4HToH-->>Transformer: output, output_bias
    end
    Transformer-->>FPDT_FFN: output, output_bias
Loading

Sequence diagram for Rotary Position Embedding

sequenceDiagram
    participant ParallelTransformerLayer
    participant RotaryEmbedding
    participant ApplyRotaryPosEmb
    ParallelTransformerLayer->>RotaryEmbedding: rotary_pos_emb(seq_length)
    RotaryEmbedding-->>ParallelTransformerLayer: rotary_pos_emb_cos, rotary_pos_emb_sin
    ParallelTransformerLayer->>ApplyRotaryPosEmb: apply_rotary_pos_emb(query_layer, rotary_pos_emb_cos)
    ParallelTransformerLayer->>ApplyRotaryPosEmb: apply_rotary_pos_emb(key_layer, rotary_pos_emb_sin)
Loading

Class diagram for FPDT and Rotary Position Embedding

classDiagram
    class Transformer {
        +bool ds_sequence_parallel_fpdt
        +int fpdt_FFN_chunk_size
        +forward(hidden_states)
    }
    class ParallelTransformerLayer {
        +bool ds_sequence_parallel_fpdt
        +bool ds_sequence_parallel_fpdt_offloading
        +self_attention
        +forward(hidden_states, attention_mask, inference_params, rotary_pos_emb)
    }
    class RotaryEmbedding {
        +Tensor inv_freq
        +float theta
        +forward(max_seq_len, offset)
    }
    class FPDT_Attention {
        +Parameter qkv_linear_weight
        +Parameter qkv_linear_bias
        +Parameter qkv_dense_weight
        +Parameter qkv_dense_bias
        +FPDT_Attention(config, qkv_linear_weight, qkv_linear_bias, qkv_dense_weight, qkv_dense_bias, sequence_process_group, gather_idx, return_bias, chunk_size, enable_offloading)
    }
    Transformer --> FPDT_Attention
    ParallelTransformerLayer --> FPDT_Attention
    ParallelTransformerLayer --> RotaryEmbedding
Loading

File-Level Changes

Change Details Files
Added FPDT (Fast Path Decoupled Training) support for sequence parallel processing
  • Added FPDT configuration options and parameters
  • Implemented FPDT chunk size handling for sequence parallelism
  • Added FPDT FFN and Attention implementations
  • Modified warmup function to use FPDT chunk size when enabled
megatron/model/transformer.py
megatron/arguments.py
megatron/initialize.py
Modified attention mechanism to support FPDT and batch processing
  • Updated query/key/value tensor handling for FPDT
  • Added support for batch dimension index in distributed attention
  • Modified attention mask and bias handling for FPDT compatibility
megatron/model/transformer.py
Updated rotary position embedding implementation
  • Modified buffer handling for rotary embeddings
  • Added support for cos/sin separate tensors
  • Updated position embedding application in attention mechanism
megatron/model/rotary_pos_embedding.py
megatron/model/language_model.py
Added example configuration for FPDT training
  • Created new training script for GPT model with FPDT
  • Added configuration parameters for FPDT chunk size and offloading
  • Included sequence parallel size settings
examples_deepspeed/sequence_parallel/ds_pretrain_gpt_6.7B_fpdt_32k.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @saforem2 - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding documentation explaining the FPDT optimization approach and when it should be used vs standard execution
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

super().__init__()
inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer('inv_freq', inv_freq)
self.inv_freq = inv_freq.to(get_accelerator().current_device_name())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): Moving from register_buffer to regular tensor attribute will cause unnecessary tensor recreation

Consider keeping this as a register_buffer to avoid recreating the tensor on every forward pass

dtype = torch.float32

# Warmup fused bias+gelu
seq_length = args.seq_length
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Sequence length divisibility requirements should be validated explicitly

Add explicit validation that sequence length is divisible by required factors to fail fast with a clear error message

# Query, Key, and Value
# =====================

if self.attention_type == AttnType.self_attn:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring the parallel attention implementation to use a cleaner class hierarchy with FPDT as the primary path

The parallel attention implementation has become overly complex with duplicate code paths. Consider refactoring to make FPDT the primary implementation:

class ParallelAttention:
    def __init__(self, config, ...):
        self.use_legacy = not config.use_fpdt
        self.fpdt_attention = FPDTAttention(...) if not self.use_legacy else None
        self.legacy_attention = LegacyAttention(...) if self.use_legacy else None

    def forward(self, hidden_states, attention_mask, ...):
        if self.use_legacy:
            return self.legacy_attention(hidden_states, attention_mask, ...)
        return self.fpdt_attention(hidden_states, attention_mask, ...)

This separates the implementations while maintaining compatibility. The FPDT path removes unnecessary dropout/bias complexity. Consider deprecating the legacy path in future versions.

inkcherry and others added 7 commits December 18, 2024 10:04
* [tools]GQA convert support

* fix readme
Previously, `deepspeed_to_megatron.py` would raise an import error
due to the relative import.

This commit fixes this issue by changing from the relative import
to the absolute import like in `deepspeed_to_transformers.py`.
Signed-off-by: Logan Adams <loadams@microsoft.com>
…run successfully with DeepSpeed (#468)

Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Schwidola0607 <khoadangpham82944@gmail.com>
…nabled (#479)

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add fused_rms_norm support on XPU device (#431)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [LLaMa] Adding support converting checkpoint from mds to hf (#432)

* add support converting checkpoint from hf to mds

* Fix PP issue

* update

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add device check when import ipex (#436)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix TFLOPs calculation (#371)

* fix TFLOPs calculation

when GQA used, we observe right TFLOPs after this fix.
when GQA is not used, huge difference in TFLOPs is solved with
selective recompute .
some other minor difference will also be observed as logits macs also added.

* add copyrights

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix nan issue when running megatron-deepspeed (#434)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* enable empty cache on XPU device (#438)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [wandb] disable wandb more gracefully (#422)

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [Bug] Fix crash when logging optimizer state to tb (#417)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* remove unnecessary files

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* set the warmup length to be FPDT chunk size if enabled

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* Enable Sequence Parallelism (#429)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix init issue for rms_norm in squence_parallel (#448)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* enable profiler for specific ranks (#451)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix init issue for silently ignoring the deepspeed config (#452)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix moe tflops (#445)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [tool]GQA convert support (#454)

* [tools]GQA convert support

* fix readme

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* Fix import error in `deepspeed_to_megatron.py` (#455)

Previously, `deepspeed_to_megatron.py` would raise an import error
due to the relative import.

This commit fixes this issue by changing from the relative import
to the absolute import like in `deepspeed_to_transformers.py`.

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* Update references to new GitHub org (deepspeedai) (#462)

Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468)

Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix bug when FPDT is disabled but with original Ulysses

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
Signed-off-by: jinghan yao yjhmitweb@gmail.com
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

---------

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: jinghan yao yjhmitweb@gmail.com
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu>
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn>
Co-authored-by: billishyahao <yahao.he@gmail.com>
Co-authored-by: Polisetty V R K Jyothendra Varma <jvarma@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>
Co-authored-by: ranzhejiang <zhejiang.ran@intel.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: hotsuyuki <hotsuyuki.kawanishi@gmail.com>
Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants