Skip to content

Need clarification of Gopher in Step 2 #172

Open
@mihara-bot

Description

@mihara-bot

Dear authors,
I was trying to reimplement the Dolma-Web described in your paper.
However, in the Step 2, using the dolma toolkit, I found Gopher implementation in this repo something different with original Gopher at http://arxiv.org/abs/2112.11446.
Specifically,
There are no computations for 'Duplicate paragraph fraction' and 'Duplicate paragraph character fraction' in current code at /python/dolma/taggers.py , which are provided in Table A1 in the Gopher paper.

Is this a bug or there is no need to compute these metrics? Looking forward to your kind reply.

Best regards,
Xinlin Zhuang

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions