Skip to content

Commit

Permalink
update data_utils
Browse files Browse the repository at this point in the history
  • Loading branch information
carpedm20 committed Jan 23, 2016
1 parent 15c5e7b commit 440964f
Show file tree
Hide file tree
Showing 4 changed files with 29 additions and 17 deletions.
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,12 @@ Prerequisites
Usage
-----

First, you need to download [DeepMind Q&A Dataset](https://github.com/deepmind/rc-data) from [here](https://github.com/deepmind/rc-data) or [here](http://cs.nyu.edu/~kcho/DMQA/).
First, you need to download [DeepMind Q&A Dataset](https://github.com/deepmind/rc-data) from [here](http://cs.nyu.edu/~kcho/DMQA/), save `cnn.tgz` and `dailymail.tgz` into the repo, and run:

$ ./unzip.sh cnn.tgz dailymail.tgz

Then run the pre-processing code with:
$ python data_utils.py data cnn

To train a model with `cnn` dataset:

Expand Down
15 changes: 13 additions & 2 deletions data_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,10 @@
from __future__ import division
from __future__ import print_function

import gzip
import os
import re
import sys
import gzip
import tarfile
from tqdm import *
from glob import glob
Expand Down Expand Up @@ -265,4 +266,14 @@ def prepare_data(data_dir, dataset, vocab_size):
questions_to_token_ids(train_path, vocab_fname, vocab_size)

if __name__ == '__main__':
prepare_data('data', 'cnn', 1000000)
if len(sys.argv) < 3:
print(" [*] usage: python data_utils.py DATA_DIR DATASET_NAME VOCAB_SIZE")
else:
data_dir = sys.argv[1]
dataset_name = sys.argv[2]
if len(sys.argv) > 3:
vocab_size = sys.argv[3]
else:
vocab_size = 100000

prepare_data(data_dir, dataset_name, vocab_size)
2 changes: 1 addition & 1 deletion main.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
flags.DEFINE_float("learning_rate", 0.0002, "Learning rate of for adam [0.0002]")
flags.DEFINE_string("model", "LSTM", "The type of model to train and test [LSTM, Attentive, Impatient]")
flags.DEFINE_string("data_dir", "data", "The name of data directory [data]")
flags.DEFINE_string("dataset", "cnn", "The name of dataset [cnn, dailymail]")
flags.DEFINE_string("dataset", "small", "The name of dataset [cnn, dailymail]")
flags.DEFINE_string("checkpoint_dir", "checkpoint", "Directory name to save the checkpoints [checkpoint]")
flags.DEFINE_boolean("forward_only", False, "True for forward only, False for training [False]")
FLAGS = flags.FLAGS
Expand Down
22 changes: 9 additions & 13 deletions unzip.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,12 @@ if [ ! -d ./data ]; then
mkdir -p ./data
fi

echo "Unzip cnn.tgz..."
if [ type "pigz" &> /dev/null ]; then
tar -xvf -C data/ | pigz > cnn.tgz
else
tar -xzvf cnn.tgz -C data/
fi

echo "Unzip cnn.tgz..."
if [ type "pigz" &> /dev/null ]; then
tar -xvf -C data/ | pigz > dailymail.tgz
else
tar -xzvf dailymail.tgz -C data/
fi
for file in "$@"; do
if which pigz > /dev/null; then
echo "Unzip $file with pigz..."
tar -I pigz -xvf $file -C data/
else
echo "Unzip $file..."
tar -xvf $file -C data/
fi
done

0 comments on commit 440964f

Please sign in to comment.