You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To obtain the training data, navigate to [https://www.smashwords.com/] and navigate the website to restrict the books to obtain to the desired categories (e.g. only free books of >=20,000 word length, of a certain genre, etc.). The resulting URL in the browser with the paginated list of books can be passed to this script to download all books in English that are available in text file format:
13
15
```bash
14
-
python smashwords.py [URL] [SAVE_DIRECTORY (defaults to ./books)]
16
+
python smashwords.py [URL] [SAVE_DIRECTORY (defaults to data/books)]
We use Google's pre-trained 300-dimensional word vectors, which can be downloaded [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) (this link just mirrors the download link of the [official website](https://code.google.com/archive/p/word2vec/)). The general model is of course independent of what word vectors are fed.
19
21
22
+
### Preprocessing
23
+
20
24
To clean the training data, there is a script provided that will textually normalize (sentences are extracted and only alphanumerics and apostrophes are kept) all the files in the input directory, convert the words to unique integer IDs according to the provided word vector model, and save them in the TensorFlow binary format. See the help page for further details:
21
25
```bash
22
26
python clean.py --help
23
27
```
24
28
29
+
### Training the model
30
+
25
31
To train the model, change hyperparameters, or get sentence embeddings, see the help page of the training script:
26
32
```bash
27
33
python train.py --help
28
34
```
29
35
36
+
### Running a trained model
37
+
30
38
To use the model in any python script, follow this basic pattern:
31
39
```python
32
40
import tensorflow as tf
@@ -43,13 +51,23 @@ with graph.as_default():
43
51
model = SkipThoughts(word2vec_model, **kwargs)
44
52
45
53
with tf.Session(graph=graph):
46
-
# Restore the model only once:
47
-
model.restore(save_dir) # pass in the directory where the .ckpt files live.
54
+
# Restore the model only once.
55
+
# Here, `save_dir` is the directory where the .ckpt files live. Typically
56
+
# this would be "output/mymodel" where --model_name=mymodel in train.py.
57
+
model.restore(save_dir)
48
58
49
59
# Run the model like this as many times as desired.
50
60
print(model.encode(sentence_strings))
51
61
```
52
62
63
+
### Evaluating a trained model
64
+
65
+
We provide an evaluation script to test the quality of the sentence vectors
66
+
produced by the trained model.
67
+
68
+
69
+
70
+
53
71
## Dependencies
54
72
55
73
All the dependencies are listed in the `requirements.txt` file. They can be installed with `pip` as follows:
0 commit comments