A DB of Synonyms, Paraphrases, and Hypernyms for all Wiki Things (Articles)
$ git clone https://github.com/infolab-csail/wikithingsdb.git
$ cd wikithingsdb
wikithingsdb$ python setup.py develop # to stay updated on new developments
wikithingsdb$ python2.7 -c "import nltk; nltk.download('punkt');"
Follow these steps before running wikithingsdb
.
- Create the database
$ mysql -u root -D py_wikipedia -e "CREATE DATABASE py_wikipedia";
- Download the page and redirect dump
$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz enwiki-YYYYMMDD-page.sql.gz
$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-redirect.sql.gz enwiki-YYYYMMDD-redirect.sql.gz
- Load the
page
andredirect
tables
$ zcat enwiki-YYYYMMDD-page.sql.gz | mysql -u root -D py_wikipedia
$ zcat enwiki-YYYYMMDD-redirect.sql.gz | mysql -u root -D py_wikipedia
- Download the article dump
$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 enwiki-YYYYMMDD-pages-articles.xml.bz2
- Run WikiExtractor
$ bzip2 -dc enwiki-YYYYMMDD-pages-articles.xml.bz2 | python /path/to/defexpand/scripts/WikiExtractor.py -l -o extracted
- Merge the output of WikiExtractor
$ ./scripts/merge_extracted.sh /path/to/extracted/ /path/to/output/merged.xml
- Create partitions for WikiThingsDB
$ python scripts/partition.py -f merged.xml -n 10 -o /path/to/output/partitions
On a machine with 8GB of RAM, 10 partitions worked well.
First, make sure this line is in your bashrc: source /data/infolab/misc/elasticstart/elasticstart.env
. If not, add it and refresh your prompt.
To create WikiThingsDB, run:
$ wikithingsdb -t 11 /path/to/partitions -l create.log
Roughly, the threads param (-t
) should have a value of number of cores - 1
. You can find how many cores a machine has by running cat /proc/cpuinfo | grep processor | wc -l
.