-
Notifications
You must be signed in to change notification settings - Fork 0
Adding the new feature of FPDT (#441) #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * remove unnecessary files * set the warmup length to be FPDT chunk size if enabled --------- Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>
Reviewer's Guide by SourceryThis PR implements FPDT (Fast Path Decoupled Training) support and Ulysses rotary position embedding in the transformer model. The main changes include adding FPDT functionality for sequence parallel processing, modifying the attention mechanism to support FPDT, and updating the rotary position embedding implementation. Sequence diagram for FPDT in Transformer forward passsequenceDiagram
participant Transformer
participant FPDT_FFN
participant DenseHTo4H
participant ActivationFunc
participant Dense4HToH
Transformer->>FPDT_FFN: apply(hidden_states, dense_h_to_4h.weight, dense_h_to_4h.bias, dense_4h_to_h.weight, dense_4h_to_h.bias, add_bias, fpdt_FFN_chunk_size)
alt FPDT not enabled
Transformer->>DenseHTo4H: dense_h_to_4h(hidden_states)
DenseHTo4H-->>Transformer: intermediate_parallel, bias_parallel
Transformer->>ActivationFunc: activation_func(intermediate_parallel)
ActivationFunc-->>Transformer: intermediate_parallel
Transformer->>Dense4HToH: dense_4h_to_h(intermediate_parallel)
Dense4HToH-->>Transformer: output, output_bias
end
Transformer-->>FPDT_FFN: output, output_bias
Sequence diagram for Rotary Position EmbeddingsequenceDiagram
participant ParallelTransformerLayer
participant RotaryEmbedding
participant ApplyRotaryPosEmb
ParallelTransformerLayer->>RotaryEmbedding: rotary_pos_emb(seq_length)
RotaryEmbedding-->>ParallelTransformerLayer: rotary_pos_emb_cos, rotary_pos_emb_sin
ParallelTransformerLayer->>ApplyRotaryPosEmb: apply_rotary_pos_emb(query_layer, rotary_pos_emb_cos)
ParallelTransformerLayer->>ApplyRotaryPosEmb: apply_rotary_pos_emb(key_layer, rotary_pos_emb_sin)
Class diagram for FPDT and Rotary Position EmbeddingclassDiagram
class Transformer {
+bool ds_sequence_parallel_fpdt
+int fpdt_FFN_chunk_size
+forward(hidden_states)
}
class ParallelTransformerLayer {
+bool ds_sequence_parallel_fpdt
+bool ds_sequence_parallel_fpdt_offloading
+self_attention
+forward(hidden_states, attention_mask, inference_params, rotary_pos_emb)
}
class RotaryEmbedding {
+Tensor inv_freq
+float theta
+forward(max_seq_len, offset)
}
class FPDT_Attention {
+Parameter qkv_linear_weight
+Parameter qkv_linear_bias
+Parameter qkv_dense_weight
+Parameter qkv_dense_bias
+FPDT_Attention(config, qkv_linear_weight, qkv_linear_bias, qkv_dense_weight, qkv_dense_bias, sequence_process_group, gather_idx, return_bias, chunk_size, enable_offloading)
}
Transformer --> FPDT_Attention
ParallelTransformerLayer --> FPDT_Attention
ParallelTransformerLayer --> RotaryEmbedding
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @saforem2 - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider adding documentation explaining the FPDT optimization approach and when it should be used vs standard execution
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| super().__init__() | ||
| inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim)) | ||
| self.register_buffer('inv_freq', inv_freq) | ||
| self.inv_freq = inv_freq.to(get_accelerator().current_device_name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (performance): Moving from register_buffer to regular tensor attribute will cause unnecessary tensor recreation
Consider keeping this as a register_buffer to avoid recreating the tensor on every forward pass
| dtype = torch.float32 | ||
|
|
||
| # Warmup fused bias+gelu | ||
| seq_length = args.seq_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Sequence length divisibility requirements should be validated explicitly
Add explicit validation that sequence length is divisible by required factors to fail fast with a clear error message
| # Query, Key, and Value | ||
| # ===================== | ||
|
|
||
| if self.attention_type == AttnType.self_attn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider refactoring the parallel attention implementation to use a cleaner class hierarchy with FPDT as the primary path
The parallel attention implementation has become overly complex with duplicate code paths. Consider refactoring to make FPDT the primary implementation:
class ParallelAttention:
def __init__(self, config, ...):
self.use_legacy = not config.use_fpdt
self.fpdt_attention = FPDTAttention(...) if not self.use_legacy else None
self.legacy_attention = LegacyAttention(...) if self.use_legacy else None
def forward(self, hidden_states, attention_mask, ...):
if self.use_legacy:
return self.legacy_attention(hidden_states, attention_mask, ...)
return self.fpdt_attention(hidden_states, attention_mask, ...)This separates the implementations while maintaining compatibility. The FPDT path removes unnecessary dropout/bias complexity. Consider deprecating the legacy path in future versions.
* [tools]GQA convert support * fix readme
Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.
Signed-off-by: Logan Adams <loadams@microsoft.com>
…run successfully with DeepSpeed (#468) Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Schwidola0607 <khoadangpham82944@gmail.com>
…nabled (#479) * pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add fused_rms_norm support on XPU device (#431) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [LLaMa] Adding support converting checkpoint from mds to hf (#432) * add support converting checkpoint from hf to mds * Fix PP issue * update Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add device check when import ipex (#436) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix TFLOPs calculation (#371) * fix TFLOPs calculation when GQA used, we observe right TFLOPs after this fix. when GQA is not used, huge difference in TFLOPs is solved with selective recompute . some other minor difference will also be observed as logits macs also added. * add copyrights Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix nan issue when running megatron-deepspeed (#434) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable empty cache on XPU device (#438) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [wandb] disable wandb more gracefully (#422) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [Bug] Fix crash when logging optimizer state to tb (#417) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * remove unnecessary files Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * set the warmup length to be FPDT chunk size if enabled Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Enable Sequence Parallelism (#429) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for rms_norm in squence_parallel (#448) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable profiler for specific ranks (#451) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for silently ignoring the deepspeed config (#452) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix moe tflops (#445) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [tool]GQA convert support (#454) * [tools]GQA convert support * fix readme Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Fix import error in `deepspeed_to_megatron.py` (#455) Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`. Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Update references to new GitHub org (deepspeedai) (#462) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468) Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix bug when FPDT is disabled but with original Ulysses Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: jinghan yao yjhmitweb@gmail.com Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> --------- Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: jinghan yao yjhmitweb@gmail.com Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn> Co-authored-by: billishyahao <yahao.he@gmail.com> Co-authored-by: Polisetty V R K Jyothendra Varma <jvarma@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu> Co-authored-by: ranzhejiang <zhejiang.ran@intel.com> Co-authored-by: Xinyu Lian <lian7@illinois.edu> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: hotsuyuki <hotsuyuki.kawanishi@gmail.com> Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1
add FPDT support; add Ulysses rotary position embedding support
add FPDT support; add Ulysses rotary position embedding support
add FPDT support; add Ulysses rotary position embedding support
add FPDT support; add Ulysses rotary position embedding support
remove unnecessary files
set the warmup length to be FPDT chunk size if enabled
Summary by Sourcery
Add support for FPDT to improve model performance with large batch sizes and introduce Ulysses rotary position embedding for better positional encoding. Include a new script for pretraining GPT-3 models with FPDT, and optimize warmup length to match FPDT chunk size.
New Features:
Enhancements:
Build: