-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when i use tokenizer , I obtained many patterns that span across the data, which is quite strange. #39
Comments
This deduplicator doesn't know anything about documents. It just knows strings. Do you have a document separator that you use that's not present in any of the documents? (e.g., if you have a tokenizer with <65k tokens you can use \xff\xff\xff\xff as a separator. |
i use the \xff\xff as a separator . the tokenizer is gpt2 with <51k. Is there a big difference between "\xff\xff" and "\xff\xff\xff\xff"? thx for reply |
Huh. If you can be sure that 0xff00 isn't a valid token then \xff\xff should work because you should never be able get away with 2. Do you put a unique counter between documents as well? Otherwise it could match [final bit of document 1][document separator][beginning of document 2] to a document 3/4 if those were in the same positions. |
maybe,I'll try it. thx |
just like the pattern del offset is [7511, 8038] ,but the doc start ,end is [6604 7516]
this data accounts for 0.9% of the total
The text was updated successfully, but these errors were encountered: