WikiThingsDB

A DB of Synonyms, Paraphrases, and Hypernyms for all Wiki Things (Articles)

Install

$ git clone https://github.com/infolab-csail/wikithingsdb.git
$ cd wikithingsdb
wikithingsdb$ python setup.py develop  # to stay updated on new developments
wikithingsdb$ python2.7 -c "import nltk; nltk.download('punkt');"

Set Up

Follow these steps before running wikithingsdb.

Bootstrapping the database

Create the database

$ mysql -u root -D py_wikipedia -e "CREATE DATABASE py_wikipedia";

Download the page and redirect dump

$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz enwiki-YYYYMMDD-page.sql.gz
$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-redirect.sql.gz enwiki-YYYYMMDD-redirect.sql.gz

Load the page and redirect tables

$ zcat enwiki-YYYYMMDD-page.sql.gz | mysql -u root -D py_wikipedia
$ zcat enwiki-YYYYMMDD-redirect.sql.gz | mysql -u root -D py_wikipedia

Generating plaintext articles

Download the article dump

$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 enwiki-YYYYMMDD-pages-articles.xml.bz2

Run WikiExtractor

$ bzip2 -dc enwiki-YYYYMMDD-pages-articles.xml.bz2 | python /path/to/defexpand/scripts/WikiExtractor.py -l -o extracted

Merge the output of WikiExtractor

$ ./scripts/merge_extracted.sh /path/to/extracted/ /path/to/output/merged.xml

Create partitions for WikiThingsDB

$ python scripts/partition.py -f merged.xml -n 10 -o /path/to/output/partitions

On a machine with 8GB of RAM, 10 partitions worked well.

Create

First, make sure this line is in your bashrc: source /data/infolab/misc/elasticstart/elasticstart.env. If not, add it and refresh your prompt.

To create WikiThingsDB, run:

$ wikithingsdb -t 11 /path/to/partitions -l create.log

Roughly, the threads param (-t) should have a value of number of cores - 1. You can find how many cores a machine has by running cat /proc/cpuinfo | grep processor | wc -l.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
scripts		scripts
tests		tests
wikithingsdb		wikithingsdb
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiThingsDB

Install

Set Up

Bootstrapping the database

Generating plaintext articles

Create

About

Releases

Packages

Contributors 7

Languages

infolab-csail/wikithingsdb

Folders and files

Latest commit

History

Repository files navigation

WikiThingsDB

Install

Set Up

Bootstrapping the database

Generating plaintext articles

Create

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages