Open
Description
Dear authors,
I was trying to reimplement the Dolma-Web described in your paper.
However, in the Step 2, using the dolma toolkit, I found Gopher implementation in this repo something different with original Gopher at http://arxiv.org/abs/2112.11446.
Specifically,
There are no computations for 'Duplicate paragraph fraction' and 'Duplicate paragraph character fraction' in current code at /python/dolma/taggers.py , which are provided in Table A1 in the Gopher paper.
Is this a bug or there is no need to compute these metrics? Looking forward to your kind reply.
Best regards,
Xinlin Zhuang
Metadata
Metadata
Assignees
Labels
No labels