[Topic] 数据配比对训练的影响 #7

wj-Mcat · 2024-08-22T02:09:03Z

[8.22-8.30] 这段时间想研究这个子方向

wj-Mcat · 2024-08-22T02:14:02Z

参考资料

要解决的问题

预训练包含不同来源的数据，那不同数据之间可能会产生：相互增益、相互冲突甚至毫不相干的关系，此时如何评估不同数据之间对模型效果的影响，如何调整不同数据之间的配比进而平衡好各领域的能力，如何让避免冲突数据且让相关数据相互增益进而激发模型最大的能力。

剩下的部分不读了

问题定义、实验内容都不始很详细，比如如何判断不同数据之间是否存在冲突或相互增益就没说清楚。

关于这篇论文，大家只需要知道一个重点：数据配比很重要。

wj-Mcat · 2024-08-22T02:30:20Z

参考资料：Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

结论：Unknown 的数据对于LLM原本自带的能力是有损害的，数据越多对模型原有能力的损害会越大。

wj-Mcat · 2024-08-22T02:31:50Z

参考资料

Doremi: Optimizing data mixtures speeds up language model pretraining

wj-Mcat · 2024-08-22T02:32:16Z

参考资料

An empirical study of catastrophic forgetting in large language models during continual fine-tuning

wj-Mcat · 2024-08-22T03:33:14Z

参考资料

https://quickcreator.io/seo/simple-3-step-guide-for-llm-training-data/

wj-Mcat · 2024-08-22T06:40:29Z

参考资料

数据分布非常广

LLM 的相关能力都是需要定向构建数据才会有相关的能力。

当然，会具备一定的涌现能力，可是如果在这个领域里稍加数据引导，涌现之后的能力将会变得更强。

wj-Mcat · 2024-08-26T08:05:31Z

参考内容：Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

为了研究不同训练数据类别和配比对于训练效果的影响，作者开发了一个低成本数据混合策略，用来验证不同数据配比对模型效果的影响。

数据集

Pile和ROOTS 数据集都是认为规定了数据分布范围和比例。
GLaM：人为规定了数据集的分布范围和比例，可是没有揭露太多细节。
DoReMi [50] and DoGE [15]: 提出一种基于训练的方式来优化不同数据配比：模型训练完之后的效果是否有变得更好。

proposed
learning-based methods to optimize domain proportions by iterating between training reference and
proxy models

wj-Mcat · 2024-08-26T08:20:15Z

Datasets for Large Language Models: A Comprehensive Survey

wj-Mcat self-assigned this Aug 22, 2024

wj-Mcat added the Topic Extra attention is needed label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Topic] 数据配比对训练的影响 #7

[Topic] 数据配比对训练的影响 #7

wj-Mcat commented Aug 22, 2024

wj-Mcat commented Aug 22, 2024 •

edited

Loading

wj-Mcat commented Aug 22, 2024 •

edited

Loading

wj-Mcat commented Aug 22, 2024

wj-Mcat commented Aug 22, 2024

wj-Mcat commented Aug 22, 2024

wj-Mcat commented Aug 22, 2024 •

edited

Loading

wj-Mcat commented Aug 26, 2024 •

edited

Loading

wj-Mcat commented Aug 26, 2024

[Topic] 数据配比对训练的影响 #7

[Topic] 数据配比对训练的影响 #7

Comments

wj-Mcat commented Aug 22, 2024

wj-Mcat commented Aug 22, 2024 • edited Loading

参考资料

要解决的问题

剩下的部分不读了

wj-Mcat commented Aug 22, 2024 • edited Loading

参考资料：Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

wj-Mcat commented Aug 22, 2024

参考资料

wj-Mcat commented Aug 22, 2024

参考资料

wj-Mcat commented Aug 22, 2024

参考资料

wj-Mcat commented Aug 22, 2024 • edited Loading

参考资料

wj-Mcat commented Aug 26, 2024 • edited Loading

参考内容：Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

数据集

wj-Mcat commented Aug 26, 2024

Datasets for Large Language Models: A Comprehensive Survey

wj-Mcat commented Aug 22, 2024 •

edited

Loading

wj-Mcat commented Aug 22, 2024 •

edited

Loading

wj-Mcat commented Aug 22, 2024 •

edited

Loading

wj-Mcat commented Aug 26, 2024 •

edited

Loading