-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for llm/gpt-3 #6570
Conversation
Replace parameters with config in MHA; Replace GPTEmbedding ParamAttr initialiizer with _init_weights;Modify fuse_attention_qkv parameter
Thanks for your contribution! |
Codecov Report
@@ Coverage Diff @@
## develop #6570 +/- ##
========================================
Coverage 62.94% 62.94%
========================================
Files 531 531
Lines 77727 77727
========================================
Hits 48923 48923
Misses 28804 28804 |
llm/gpt-3/modeling.py
Outdated
need_weights=False, # | ||
weight_attr=None, # | ||
bias_attr=None, # | ||
do_recompute=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看一下这些 参数的使用,应该都可以删除
kdim=None, #
vdim=None, #
need_weights=False, #
weight_attr=None, #
bias_attr=None, #
do_recompute=False,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除了kdim, vdim, need_weights和bias_attr, 保留了weight_attr和do_recompute作为TransformerDecoderLayer参数接口
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm/gpt-3/modeling.py
Outdated
embed_dim = config.hidden_size | ||
self.embed_dim = config.hidden_size | ||
self.kdim = kdim if kdim is not None else config.hidden_size | ||
self.vdim = vdim if vdim is not None else config.hidden_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 kdim vdim 应该不单独传入,直接 用 config.hidden_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两处都直接使用hidden_size了,删除了kdim和vdim
llm/gpt-3/modeling.py
Outdated
need_weights=False, # | ||
weight_attr=None, # | ||
bias_attr=None, # | ||
do_recompute=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
if num_partitions > 1: | ||
if config.tensor_parallel_degree > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert self.num_heads % config.tensor_parallel_degree == 0
self.num_heads = self.num_heads // config.tensor_parallel_degree
llm/gpt-3/modeling.py
Outdated
if isinstance(layer, (nn.Linear, | ||
nn.Embedding, | ||
fleet.meta_parallel.VocabParallelEmbedding)): | ||
# In the dygraph mode, use the `set_value` to reset the parameter directly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此处不全,参考llama
@@ -682,6 +669,17 @@ def get_tensor_parallel_split_mappings(num_layers): | |||
"layers.0.linear2.weight": partial(fn, is_column=False), | |||
} | |||
|
|||
if config.fuse_attention_qkv: | |||
base_actions["layers.0.self_attn.qkv_proj.weight"] = partial(fn, is_column=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单卡。tp=2前向精度。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm/gpt-3/modeling.py
Outdated
|
||
self.head_dim = embed_dim // num_heads | ||
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads" | ||
self.use_flash_attn = config.use_flash_attn if flash_attention else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_flash_attention
llm/gpt-3/README.md
Outdated
使用下面脚本,即可在llama-7b的基础上,继续训练. | ||
注意: | ||
1. 需要paddle develop版本训练,需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包 | ||
2. `use_flash_attn` 需要在A100机器开启,否则loss可能不正常(很快变成0.00x,非常小不正常)。建议使用cuda11.8环境。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. `use_flash_attn` 需要在A100机器开启,否则loss可能不正常(很快变成0.00x,非常小不正常)。建议使用cuda11.8环境。 | |
2. `use_flash_attention` 需要在A100机器开启,否则loss可能不正常(很快变成0.00x,非常小不正常)。建议使用cuda11.8环境。 |
llm/gpt-3/README.md
Outdated
export PYTHONPATH="../../PaddleNLP/" | ||
export FLAGS_cudnn_deterministic=True | ||
log_dir="log" | ||
rm -rf $log_dir | ||
|
||
python -u -m paddle.distributed.launch \ | ||
--gpus "0" \ | ||
--gpus "6,7" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--gpus "6,7" \ | |
--gpus "0" \ |
llm/gpt-3/modeling.py
Outdated
|
||
if config.tensor_parallel_degree > 1: | ||
assert config.num_attention_heads % config.tensor_parallel_degree == 0 | ||
config.num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config.num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree | |
self.num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree |
修改了原始变量的地方,建议重新赋值一遍。不要直接修改config
llm/gpt-3/modeling.py
Outdated
@@ -270,10 +233,10 @@ def gen_cache(self, key, value=None, type=Cache): | |||
return self.StaticCache(k, v) | |||
elif value is None: # incremental_state | |||
k = layers.fill_constant_batch_size_like( | |||
input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0 | |||
input=key, shape=[-1, self.config.num_attention_heads, 0, self.head_dim], dtype=key.dtype, value=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input=key, shape=[-1, self.config.num_attention_heads, 0, self.head_dim], dtype=key.dtype, value=0 | |
input=key, shape=[-1, self.num_attention_heads, 0, self.head_dim], dtype=key.dtype, value=0 |
llm/gpt-3/modeling.py
Outdated
) | ||
v = layers.fill_constant_batch_size_like( | ||
input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0 | ||
input=key, shape=[-1, self.config.num_attention_heads, 0, self.head_dim], dtype=key.dtype, value=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input=key, shape=[-1, self.config.num_attention_heads, 0, self.head_dim], dtype=key.dtype, value=0 | |
input=key, shape=[-1, self.num_attention_heads, 0, self.head_dim], dtype=key.dtype, value=0 |
llm/gpt-3/modeling.py
Outdated
# Recompute defaults to False and is controlled by Trainer | ||
self.enable_recompute = False | ||
|
||
config.use_flash_attention = config.use_flash_attention if flash_attention else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config.use_flash_attention = config.use_flash_attention if flash_attention else None | |
self.use_flash_attention = config.use_flash_attention if flash_attention else None |
llm/gpt-3/modeling.py
Outdated
|
||
out = paddle.matmul(weights, v) | ||
|
||
# combine heads | ||
out = tensor.transpose(out, perm=[0, 2, 1, 3]) | ||
out = tensor.reshape(x=out, shape=[0, 0, -1]) | ||
|
||
return (out, weights) if self.need_weights else out | ||
return (out, weights) if self.config.need_weights else out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考此处,self.config.need_weights
换成,forward 函数中的 output_attentions 参数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Function optimization
PR changes
APIs and Docs
Description
Optimization for llm/gpt-3
README.md
file.self.
parameters withconfig.
parameters inmodeling.py
._init_weights
forGPTPretrainedModel
.output_attentions
(need_weights
) to control attention weights output.