To efficiently store and access our documents, we store them in a sqlite database. The key is the doc_id
and the value is the text
.
To create a sqlite db from a corpus of documents, run:
python build_db.py /path/to/data /path/to/saved/db.db
Optional arguments:
--preprocess File path to a python module that defines a `preprocess` function.
--num-workers Number of CPU processes (for tokenizing, etc).
The data path can either be a path to a nested directory of files (such as what the WikiExtractor script outputs) or a single file. Each file should consist of JSON-encoded documents that have id
and text
fields, one per line:
{"id": "doc1", "text": "text of doc1"}
...
{"id": "docN", "text": "text of docN"}
--preprocess /path/to/.py/file
is another optional argument that allows you to supply a python module that defines a preprocess(doc_object)
function to filter/process documents before they are put in the db. See prep_wikipedia.py
for an example.
To build a TF-IDF weighted word-doc sparse matrix from the documents stored in the sqlite db, run:
python build_tfidf.py /path/to/doc/db /path/to/output/dir
Optional arguments:
--ngram Use up to N-size n-grams (e.g. 2 = unigrams + bigrams). By default only ngrams without stopwords or punctuation are kept.
--hash-size Number of buckets to use for hashing ngrams.
--tokenizer String option specifying tokenizer type to use (e.g. 'corenlp').
--num-workers Number of CPU processes (for tokenizing, etc).
The sparse matrix and its associated metadata will be saved to the output directory under <db-name>-tfidf-ngram=<N>-hash=<N>-tokenizer=<T>.npz
.
The Document Retriever can also be used interactively (like the full pipeline).
python scripts/retriever/interactive.py --model /path/to/model
>>> process('question answering', k=5)
+------+-------------------------------+-----------+
| Rank | Doc Id | Doc Score |
+------+-------------------------------+-----------+
| 1 | Question answering | 327.89 |
| 2 | Watson (computer) | 217.26 |
| 3 | Eric Nyberg | 214.36 |
| 4 | Social information seeking | 212.63 |
| 5 | Language Computer Corporation | 184.64 |
+------+-------------------------------+-----------+