Details for Wikipedia data formatting #12

ToddMorrill · 2022-11-05T18:04:57Z

I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share

the data itself
the code you used to generate the single Wikipedia text file
OR some additional details about how you're generating the single Wikipedia text file?

Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.

heyLinsir · 2022-11-06T07:01:33Z

I have similar questions.

As discussed in #6 , I tried to prepare en_wiki.txt with following steps:

wikiextractor enwiki-latest-pages-articles.xml.bz2 --json --output processed (example of output file: wiki_sample.txt)
Extract the value of text from each json item and save it into en_wiki.txt. (example of output file: en_wiki.txt)

But there might be some steps I missed. Could you please provide a detailed instruction about how to generate the single Wikipedia text file?

hieudx149 · 2023-03-29T10:10:03Z

Hi @ToddMorrill,
I see in the file preprocess.py, the code handles tokenize line by line, but I don't know what each line contains, will each line contain each passage in wikipedia pages ? If you know what it contains let me know ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details for Wikipedia data formatting #12

Details for Wikipedia data formatting #12

ToddMorrill commented Nov 5, 2022

heyLinsir commented Nov 6, 2022

hieudx149 commented Mar 29, 2023

Details for Wikipedia data formatting #12

Details for Wikipedia data formatting #12

Comments

ToddMorrill commented Nov 5, 2022

heyLinsir commented Nov 6, 2022

hieudx149 commented Mar 29, 2023