-
Notifications
You must be signed in to change notification settings - Fork 148
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Feature Description / 功能描述
目前最小哈希去重只能针对一个jsonl内的数据进行处理,对于跨jsonl去重实现起来可能比较困难。这里我前一阵子使用了huggingface的datatrove来进行跨jsonl去重,可以进行参考。
https://github.com/huggingface/datatrove
System Info (dataflow env) / 系统信息(dataflow env)
platform: linux
Additional Information / 其他补充
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request