You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share
the data itself
the code you used to generate the single Wikipedia text file
OR some additional details about how you're generating the single Wikipedia text file?
Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.
The text was updated successfully, but these errors were encountered:
Hi @ToddMorrill,
I see in the file preprocess.py, the code handles tokenize line by line, but I don't know what each line contains, will each line contain each passage in wikipedia pages ? If you know what it contains let me know ?
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share
Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.
The text was updated successfully, but these errors were encountered: