Skip to content

Commit d4f701c

Browse files
committed
Moved all data to a parent directory
1 parent 9902634 commit d4f701c

File tree

5 files changed

+29
-10
lines changed

5 files changed

+29
-10
lines changed

.gitignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ Icon?
1212

1313
# Skip Thoughts
1414
word2vecModel*
15-
books/
16-
books_tf/
15+
data/
1716
output/
1817

README.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,32 @@ This code is written for python 3.6. To download, clone this repository:
99
git clone https://github.com/danielwatson6/skip-thoughts.git
1010
```
1111

12+
### Obtaining the training data
13+
1214
To obtain the training data, navigate to [https://www.smashwords.com/] and navigate the website to restrict the books to obtain to the desired categories (e.g. only free books of >=20,000 word length, of a certain genre, etc.). The resulting URL in the browser with the paginated list of books can be passed to this script to download all books in English that are available in text file format:
1315
```bash
14-
python smashwords.py [URL] [SAVE_DIRECTORY (defaults to ./books)]
16+
python smashwords.py [URL] [SAVE_DIRECTORY (defaults to data/books)]
1517
# Example: python smashwords.py https://www.smashwords.com/books/category/1/newest/0/free/medium
1618
```
1719

1820
We use Google's pre-trained 300-dimensional word vectors, which can be downloaded [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) (this link just mirrors the download link of the [official website](https://code.google.com/archive/p/word2vec/)). The general model is of course independent of what word vectors are fed.
1921

22+
### Preprocessing
23+
2024
To clean the training data, there is a script provided that will textually normalize (sentences are extracted and only alphanumerics and apostrophes are kept) all the files in the input directory, convert the words to unique integer IDs according to the provided word vector model, and save them in the TensorFlow binary format. See the help page for further details:
2125
```bash
2226
python clean.py --help
2327
```
2428

29+
### Training the model
30+
2531
To train the model, change hyperparameters, or get sentence embeddings, see the help page of the training script:
2632
```bash
2733
python train.py --help
2834
```
2935

36+
### Running a trained model
37+
3038
To use the model in any python script, follow this basic pattern:
3139
```python
3240
import tensorflow as tf
@@ -43,13 +51,23 @@ with graph.as_default():
4351
model = SkipThoughts(word2vec_model, **kwargs)
4452

4553
with tf.Session(graph=graph):
46-
# Restore the model only once:
47-
model.restore(save_dir) # pass in the directory where the .ckpt files live.
54+
# Restore the model only once.
55+
# Here, `save_dir` is the directory where the .ckpt files live. Typically
56+
# this would be "output/mymodel" where --model_name=mymodel in train.py.
57+
model.restore(save_dir)
4858

4959
# Run the model like this as many times as desired.
5060
print(model.encode(sentence_strings))
5161
```
5262

63+
### Evaluating a trained model
64+
65+
We provide an evaluation script to test the quality of the sentence vectors
66+
produced by the trained model.
67+
68+
69+
70+
5371
## Dependencies
5472

5573
All the dependencies are listed in the `requirements.txt` file. They can be installed with `pip` as follows:

clean.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@
2121
help="Keep only the n most common words of the training data.")
2222
parser.add_argument('--max_length', type=int, default=40,
2323
help="Truncate input and output sentences to maximum length n.")
24-
parser.add_argument('--input', type=str, default="books",
24+
parser.add_argument('--input', type=str, default="data/books",
2525
help="Path to the directory containing the text files.")
26-
parser.add_argument('--output', type=str, default="books_tf",
26+
parser.add_argument('--output', type=str, default="data/books_tf",
2727
help="Path to the directory that will contain the TFRecord files.")
2828
parser.add_argument('--embeddings_path', type=str, default="./word2vecModel",
2929
help="Path to the pre-trained word embeddings model.")

smashwords.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
"""Book scraping script for smashwords.com.
22
3-
Usage: python smashwords.py [scrape_link] [output_dir (defaults to ./books)]
3+
Usage: python smashwords.py [scrape_link] [output_dir (defaults to data/books)]
44
"""
55

66
import os
@@ -25,7 +25,7 @@ def to_filename(s):
2525

2626
if __name__ == '__main__':
2727

28-
write_dir = 'books'
28+
write_dir = 'data/books'
2929
if len(sys.argv) > 2:
3030
write_dir = sys.argv[2]
3131

train.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
"""Script for training the skip-thoughts model."""
2+
13
import argparse
24
import itertools
35
import os
@@ -50,7 +52,7 @@
5052
# Configuration args
5153
parser.add_argument('--embeddings_path', type=str, default="word2vecModel",
5254
help="Path to the pre-trained word embeddings model.")
53-
parser.add_argument('--input', type=str, default="books_tf",
55+
parser.add_argument('--input', type=str, default="data/books_tf",
5456
help="Path to the directory containing the dataset TFRecord files.")
5557
parser.add_argument('--model_name', type=str, default="default",
5658
help="Will save/restore model in ./output/[model_name].")

0 commit comments

Comments
 (0)