请教增量预训练后的两个问题：1）token长尾 2）group texts

### Describe the Question
Please provide a clear and concise description of what the question is.

二次预训练的领域文本背景是游戏应用。
模型是chatglm-6b，数据量为4w条QA样本，训练方式为lora+自回归。 
 
目前遇到两个问题：
1）增量预训练之后token出现长尾现象，连续输出同一个token，且不会主动停止。
      比如：《传奇》是一款MMO类游戏，**全新玩法 全新玩法 全新玩法 全新玩法 ...** 

2）输入数据为“问题+回答”的形式，因为做了group texts，训练后的模型回答中也带有“问题”。
      比如：
      Q：请介绍一下《王者荣耀》这款游戏？
      A：《王者荣耀》是一款......的游戏。**请描述一下《和平精英》这款游戏？《和平精英》是一款.....**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

请教增量预训练后的两个问题：1）token长尾 2）group texts #83

Describe the Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

请教增量预训练后的两个问题：1）token长尾 2）group texts #83

Description

Describe the Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions