Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM-paddle] add llama1-7b pretrain with callback #239

Merged
merged 15 commits into from
Sep 28, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update config
  • Loading branch information
DrownFish19 committed Sep 27, 2023
commit af2fd2f83fd560b8a326dc6b8c445eb4e814efad
2 changes: 1 addition & 1 deletion training/benchmarks/llama1_13B/paddle
2 changes: 1 addition & 1 deletion training/benchmarks/llama1_7B/README.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请仿照其他case的格式书写一下llama1_7B预训练任务,包括模型描述、数据集下载、数据集处理脚本、代码来源开源协议等

Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwe
* 运行自动加载

#### 模型checkpoint
* 运行自动下载,参数量:7B
* 运行自动下载
* Paddle的 LLaMA 模型的权重的使用则需要遵循[License](../../paddlenlp/transformers/llama/LICENSE)。

### 框架与芯片支持情况
Expand Down
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# model info
model_name_or_path: str = "facebook/llama-13b"
tokenizer_name_or_path: str = "facebook/llama-13b"
continue_training = 0
split = "998,1,1"
max_seq_length = 2048

# training info
dataloader_num_workers = 1
max_steps = 512
save_steps = 10000
eval_steps = 10000
learning_rate = 3e-4
min_learning_rate = 3e-5
warmup_steps = 2000
weight_decay = 0.1
lr_scheduler_type = "cosine"
adam_beta1 = 0.9
adam_beta2 = 0.95
adam_epsilon = 1e-06
max_grad_norm = 1.0
target_loss = 1.0
target_ppl = 0.6
logging_steps = 1
log_freq = 1
seed = 42

# for parallel
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
tensor_parallel_degree = 1
pipeline_parallel_degree = 1
sharding_parallel_degree = 8
gradient_accumulation_steps = 32
use_flash_attention = 1
fuse_attention_qkv = 0
use_fused_rms_norm = 1
fp16 = True
fp16_opt_level = "O2"
scale_loss = 32768
sharding = "stage2"
recompute = False
recompute_granularity = "full"
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# model info
model_name_or_path: str = "facebook/llama-13b"
tokenizer_name_or_path: str = "facebook/llama-13b"
continue_training = 0
split = "998,1,1"
max_seq_length = 2048

# training info
dataloader_num_workers = 1
max_steps = 512
save_steps = 10000
eval_steps = 10000
learning_rate = 3e-4
min_learning_rate = 3e-5
warmup_steps = 2000
weight_decay = 0.1
lr_scheduler_type = "cosine"
adam_beta1 = 0.9
adam_beta2 = 0.95
adam_epsilon = 1e-06
max_grad_norm = 1.0
target_loss = 1.0
target_ppl = 0.6
logging_steps = 1
log_freq = 1
seed = 42

# for parallel
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
tensor_parallel_degree = 2
pipeline_parallel_degree = 1
sharding_parallel_degree = 4
gradient_accumulation_steps = 64
use_flash_attention = 1
fuse_attention_qkv = 0
use_fused_rms_norm = 1
fp16 = True
fp16_opt_level = "O2"
scale_loss = 32768
sharding = "stage1"
recompute = False
recompute_granularity = "full"
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# model info
model_name_or_path: str = "facebook/llama-13b"
tokenizer_name_or_path: str = "facebook/llama-13b"
continue_training = 0
split = "998,1,1"
max_seq_length = 2048

# training info
dataloader_num_workers = 1
max_steps = 512
save_steps = 10000
eval_steps = 10000
learning_rate = 3e-4
min_learning_rate = 3e-5
warmup_steps = 2000
weight_decay = 0.1
lr_scheduler_type = "cosine"
adam_beta1 = 0.9
adam_beta2 = 0.95
adam_epsilon = 1e-06
max_grad_norm = 1.0
target_loss = 1.0
target_ppl = 0.6
logging_steps = 1
log_freq = 1
seed = 42

# for parallel
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
tensor_parallel_degree = 2
pipeline_parallel_degree = 1
sharding_parallel_degree = 4
gradient_accumulation_steps = 64
use_flash_attention = 1
fuse_attention_qkv = 0
use_fused_rms_norm = 1
fp16 = True
fp16_opt_level = "O2"
scale_loss = 32768
sharding = "stage2"
recompute = False
recompute_granularity = "full"
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# model info
model_name_or_path: str = "facebook/llama-13b"
tokenizer_name_or_path: str = "facebook/llama-13b"
continue_training = 0
split = "998,1,1"
max_seq_length = 2048

# training info
dataloader_num_workers = 1
max_steps = 512
save_steps = 10000
eval_steps = 10000
learning_rate = 3e-4
min_learning_rate = 3e-5
warmup_steps = 2000
weight_decay = 0.1
lr_scheduler_type = "cosine"
adam_beta1 = 0.9
adam_beta2 = 0.95
adam_epsilon = 1e-06
max_grad_norm = 1.0
target_loss = 1.0
target_ppl = 0.6
logging_steps = 1
log_freq = 1
seed = 42

# for parallel
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
tensor_parallel_degree = 2
pipeline_parallel_degree = 4
sharding_parallel_degree = 1
gradient_accumulation_steps = 256
use_flash_attention = 1
fuse_attention_qkv = 0
use_fused_rms_norm = 1
fp16 = True
fp16_opt_level = "O2"
scale_loss = 32768
sharding = "stage1"
recompute = False
recompute_granularity = "full"
2 changes: 1 addition & 1 deletion training/nvidia/llama1_13B-paddle/config/requirements.txt
85 changes: 0 additions & 85 deletions training/nvidia/llama1_7B-paddle/config/config_A100x1x8.py

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# model info
model_name_or_path: str = "facebook/llama-7b"
tokenizer_name_or_path: str = "facebook/llama-7b"
continue_training = 0
split = "998,1,1"
max_seq_length = 2048

# training info
dataloader_num_workers = 1
max_steps = 512
save_steps = 10000
eval_steps = 10000
learning_rate = 3e-4
min_learning_rate = 3e-5
warmup_steps = 2000
weight_decay = 0.1
lr_scheduler_type = "cosine"
adam_beta1 = 0.9
adam_beta2 = 0.95
adam_epsilon = 1e-06
max_grad_norm = 1.0
target_loss = 1.0
target_ppl = 0.6
logging_steps = 1
log_freq = 1
seed = 42

# for parallel
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
tensor_parallel_degree = 1
pipeline_parallel_degree = 1
sharding_parallel_degree = 8
gradient_accumulation_steps = 32
use_flash_attention = 1
fuse_attention_qkv = 0
use_fused_rms_norm = 1
fp16 = True
fp16_opt_level = "O2"
scale_loss = 32768
sharding = "stage2"
recompute = False
recompute_granularity = "full"
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# model info
model_name_or_path: str = "facebook/llama-7b"
tokenizer_name_or_path: str = "facebook/llama-7b"
continue_training = 0
split = "998,1,1"
max_seq_length = 2048

# training info
dataloader_num_workers = 1
max_steps = 512
save_steps = 10000
eval_steps = 10000
learning_rate = 3e-4
min_learning_rate = 3e-5
warmup_steps = 2000
weight_decay = 0.1
lr_scheduler_type = "cosine"
adam_beta1 = 0.9
adam_beta2 = 0.95
adam_epsilon = 1e-06
max_grad_norm = 1.0
target_loss = 1.0
target_ppl = 0.6
logging_steps = 1
log_freq = 1
seed = 42

# for parallel
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
tensor_parallel_degree = 2
pipeline_parallel_degree = 1
sharding_parallel_degree = 4
gradient_accumulation_steps = 64
use_flash_attention = 1
fuse_attention_qkv = 0
use_fused_rms_norm = 1
fp16 = True
fp16_opt_level = "O2"
scale_loss = 32768
sharding = "stage1"
recompute = False
recompute_granularity = "full"
Loading