Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Topic] 数据配比对训练的影响 #7

Open
wj-Mcat opened this issue Aug 22, 2024 · 8 comments
Open

[Topic] 数据配比对训练的影响 #7

wj-Mcat opened this issue Aug 22, 2024 · 8 comments
Assignees
Labels
Topic Extra attention is needed

Comments

@wj-Mcat
Copy link
Owner

wj-Mcat commented Aug 22, 2024

[8.22-8.30] 这段时间想研究这个子方向

@wj-Mcat wj-Mcat self-assigned this Aug 22, 2024
@wj-Mcat wj-Mcat added the Topic Extra attention is needed label Aug 22, 2024
@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 22, 2024

参考资料

数据混合定律:通过预测语言模型表现优化数据配比

要解决的问题

预训练包含不同来源的数据,那不同数据之间可能会产生:相互增益、相互冲突甚至毫不相干的关系,此时如何评估不同数据之间对模型效果的影响,如何调整不同数据之间的配比进而平衡好各领域的能力,如何让避免冲突数据且让相关数据相互增益进而激发模型最大的能力。

剩下的部分不读了

问题定义、实验内容都不始很详细,比如如何判断不同数据之间是否存在冲突或相互增益就没说清楚。

关于这篇论文,大家只需要知道一个重点:数据配比很重要。

@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 22, 2024

参考资料:Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

结论:Unknown 的数据对于LLM原本自带的能力是有损害的,数据越多对模型原有能力的损害会越大。

image

@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 22, 2024

参考资料

Doremi: Optimizing data mixtures speeds up language model pretraining

@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 22, 2024

参考资料

An empirical study of catastrophic forgetting in large language models during continual fine-tuning

@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 22, 2024

@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 22, 2024

参考资料

数据分布非常广

image

LLM 的相关能力都是需要定向构建数据才会有相关的能力。

当然,会具备一定的涌现能力,可是如果在这个领域里稍加数据引导,涌现之后的能力将会变得更强。

@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 26, 2024

参考内容:Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

为了研究不同训练数据类别和配比对于训练效果的影响,作者开发了一个 低成本数据混合策略,用来验证不同数据配比对模型效果的影响。

数据集

Pile和ROOTS 数据集都是认为规定了数据分布范围和比例。
GLaM:人为规定了数据集的分布范围和比例,可是没有揭露太多细节。
DoReMi [50] and DoGE [15]: 提出一种基于训练的方式来优化不同数据配比:模型训练完之后的效果是否有变得更好。

proposed
learning-based methods to optimize domain proportions by iterating between training reference and
proxy models

@wj-Mcat
Copy link
Owner Author

wj-Mcat commented Aug 26, 2024

Datasets for Large Language Models: A Comprehensive Survey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Topic Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant