when i use tokenizer , I obtained many patterns that span across the data, which is quite strange. #39

gawei1995 · 2024-03-08T09:05:16Z

just like the pattern del offset is [7511, 8038] ，but the doc start ，end is [6604 7516]
this data accounts for 0.9% of the total

debug	del offset	doc start-end
error	[7511, 8038]	6604 7516
error	[221144, 221346]	217374 221220
error	[239409, 240382]	236682 239466
error	[262775, 263050]	254246 262838
error	[268452, 270316]	262838 268722
error	[288764, 289772]	286864 288954

carlini · 2024-03-16T21:03:03Z

This deduplicator doesn't know anything about documents. It just knows strings. Do you have a document separator that you use that's not present in any of the documents? (e.g., if you have a tokenizer with <65k tokens you can use \xff\xff\xff\xff as a separator.

gawei1995 · 2024-03-18T08:55:13Z

This deduplicator doesn't know anything about documents. It just knows strings. Do you have a document separator that you use that's not present in any of the documents? (e.g., if you have a tokenizer with <65k tokens you can use \xff\xff\xff\xff as a separator.该重复数据删除器对文档一无所知。它只知道字符串。您使用的文档分隔符是否存在于任何文档中？（例如，如果您的分词器具有 <65k 标记，则可以使用 \xff\xff\xff\xff 作为分隔符。

i use the \xff\xff as a separator . the tokenizer is gpt2 with <51k. Is there a big difference between "\xff\xff" and "\xff\xff\xff\xff"? thx for reply

carlini · 2024-03-18T19:46:32Z

Huh. If you can be sure that 0xff00 isn't a valid token then \xff\xff should work because you should never be able get away with 2. Do you put a unique counter between documents as well?

Otherwise it could match [final bit of document 1][document separator][beginning of document 2] to a document 3/4 if those were in the same positions.

gawei1995 · 2024-03-19T08:00:51Z

Huh. If you can be sure that 0xff00 isn't a valid token then \xff\xff should work because you should never be able get away with 2. Do you put a unique counter between documents as well?呵呵。如果您可以确定 0xff00 不是有效令牌，那么 \xff\xff 应该可以工作，因为您永远无法逃脱 2. 您是否也在文档之间放置了唯一的计数器？

Otherwise it could match [final bit of document 1][document separator][beginning of document 2] to a document 3/4 if those were in the same positions.否则，它可以将[文档 1 的最后一位][文档分隔符][文档 2 的开头]与文档 3/4 匹配（如果它们位于相同的位置）。

maybe，I'll try it. thx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when i use tokenizer , I obtained many patterns that span across the data, which is quite strange. #39

when i use tokenizer , I obtained many patterns that span across the data, which is quite strange. #39

gawei1995 commented Mar 8, 2024 •

edited

Loading

carlini commented Mar 16, 2024

gawei1995 commented Mar 18, 2024

carlini commented Mar 18, 2024

gawei1995 commented Mar 19, 2024

when i use tokenizer , I obtained many patterns that span across the data, which is quite strange. #39

when i use tokenizer , I obtained many patterns that span across the data, which is quite strange. #39

Comments

gawei1995 commented Mar 8, 2024 • edited Loading

carlini commented Mar 16, 2024

gawei1995 commented Mar 18, 2024

carlini commented Mar 18, 2024

gawei1995 commented Mar 19, 2024

gawei1995 commented Mar 8, 2024 •

edited

Loading