Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Details for Wikipedia data formatting #12

Open
ToddMorrill opened this issue Nov 5, 2022 · 2 comments
Open

Details for Wikipedia data formatting #12

ToddMorrill opened this issue Nov 5, 2022 · 2 comments

Comments

@ToddMorrill
Copy link

I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share

  1. the data itself
  2. the code you used to generate the single Wikipedia text file
  3. OR some additional details about how you're generating the single Wikipedia text file?

Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.

@heyLinsir
Copy link

I have similar questions.

As discussed in #6 , I tried to prepare en_wiki.txt with following steps:

  1. wikiextractor enwiki-latest-pages-articles.xml.bz2 --json --output processed (example of output file: wiki_sample.txt)
  2. Extract the value of text from each json item and save it into en_wiki.txt. (example of output file: en_wiki.txt)

But there might be some steps I missed. Could you please provide a detailed instruction about how to generate the single Wikipedia text file?

@hieudx149
Copy link

Hi @ToddMorrill,
I see in the file preprocess.py, the code handles tokenize line by line, but I don't know what each line contains, will each line contain each passage in wikipedia pages ? If you know what it contains let me know ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants