Scrape Fandom

Fandom.com provides Wiki dumps at https://*.fandom.com/wiki/Special:Statistics, but most of the dumps are outdated, and require contacting an admin to produce a new dump.

This script scrapes Fandom.com for an updated Wiki dump. It scrapes the Special:AllPages to get a list of article names and requests a wiki dump from Special:Export. Instructions to get a corpus for natural language processing and training is provided.

Works only for English fandom sites. Some slight modifications are needed for other languages.

Notes

Will require the Chrome browser to be installed on the machine. The most up-to-date Chrome Driver will be handled by webdriver-manager. The requirements.txt file should list all Python libraries that your notebooks depend on, and they will be installed using:

pip install -r requirements.txt

Instructions

Clone the extractor locally (https://github.com/JOHW85/wikiextractor) with git clone https://github.com/JOHW85/wikiextractor
Open the terminal and cd your way to the repo dir: cd wikiextractor
Run python3 setup.py install
Finally, run run-me.sh FANDOM1 FANDOM2 in the terminal to get FANDOM1.jsonl and FANDOM2.jsonl in the directory.

Example run-me.sh harrypotter finalfantasy

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
README.md		README.md
ScrapeFandom.py		ScrapeFandom.py
json2jsonl.py		json2jsonl.py
requirements.txt		requirements.txt
run-me.sh		run-me.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrape Fandom

Notes

Instructions

About

Uh oh!

Releases

Packages

Languages

ChainSwordCS/ScrapeFandom

Folders and files

Latest commit

History

Repository files navigation

Scrape Fandom

Notes

Instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages