Skip to content

Commit

Permalink
README.md update
Browse files Browse the repository at this point in the history
  • Loading branch information
emanuelegiona committed Apr 26, 2018
1 parent 23797fc commit 4959cab
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ Dictionary built using [Google English One Million 1-grams](http://storage.googl
### Setup:

- Python 3.6.3 (Anaconda custom 64-bit)
- PyTorch 0.3.1 (0.4.0 might not work due to ["torch.Tensor and autograd.Variable changes"](https://github.com/pytorch/pytorch/releases/tag/v0.4.0))
- PyTorch 0.3.1 (0.4.0 might not work due to [torch.Tensor and autograd.Variable changes](https://github.com/pytorch/pytorch/releases/tag/v0.4.0))
- CUDA 8
- spaCy v2.0 with English models (more ["here"](https://spacy.io/usage/))
- spaCy v2.0 with English models (more [here](https://spacy.io/usage/))
- project folder must contain a folder named `batches` in the same directory of the train.py file

### Training
Expand All @@ -24,7 +24,7 @@ Start training by using this command:

where:
- the training set file is a UTF-8 encoded .txt file;
- the model file is a pre-existent .pt file (by default: word_guesser.pt).
- the model file is a pre-existent .pt file (by default: `word_guesser.pt`).

The model file is not mandatory: if not specified, it will assume there is no model and will create a model file
named `word_guesser.pt`, overwriting it in case it already exists. By starting a training specifying a model file,
Expand All @@ -45,22 +45,22 @@ specifying a model file, the model stored in that file will be used for predicti

## Features

- Multi-threaded operation in order to read from the training file, split to sentences, batching, training simultaneously (producer-consumer pattern)
- Multi-threaded operation in order to read from the training file, split to sentences, batching, training simultaneously (_producer-consumer pattern_)
- Low RAM usage due to limited queues between threads and periodic dumps of created batches
- Sentences are never padded, instead they are organized by their length and then created batches from sentences of all the same length
- Dynamic batch size: will try to create batches of maximal size (hyper-parameter `batch_dim`) as much as possible, but batches smaller than the chosen size will not be padded

## Known bugs
## Known bugs/problems

- Missing `batches` folder creation if not present
- Training corpus only accepted format is UTF-8 encoded plain text
- Slow on computation of large training corpus, might become faster implementing hierarchical softmax or negative sampling

## Consulted resources

- ["PyTorch Tutorials"](http://pytorch.org/tutorials/)
- ["Practical PyTorch"](https://github.com/spro/practical-pytorch)
- ["The Incredible PyTorch"](https://github.com/ritchieng/the-incredible-pytorch)
- ["Optimizing PyTorch training code"](https://www.sagivtech.com/2017/09/19/optimizing-pytorch-training-code/)
- ["Word Sense Disambiguation with LSTM: Do We Really Need 100 Billion Words?"](https://github.com/cltl/wsd-dynamic-sense-vector)
- [PyTorch Tutorials](http://pytorch.org/tutorials/)
- [Practical PyTorch](https://github.com/spro/practical-pytorch)
- [The Incredible PyTorch](https://github.com/ritchieng/the-incredible-pytorch)
- [Optimizing PyTorch training code](https://www.sagivtech.com/2017/09/19/optimizing-pytorch-training-code/)
- [Word Sense Disambiguation with LSTM: Do We Really Need 100 Billion Words?](https://github.com/cltl/wsd-dynamic-sense-vector)

0 comments on commit 4959cab

Please sign in to comment.