From 2d453cdd3646b69667edaebd723191f5b0ad3574 Mon Sep 17 00:00:00 2001 From: Adeel Ahmad Date: Wed, 25 Mar 2020 05:01:19 -0700 Subject: [PATCH] Make minor grammatical improvements (#864) Summary: I was following the tutorial on "Word representation" and thought the grammar could use a bit of polishing. Pull Request resolved: https://github.com/facebookresearch/fastText/pull/864 Reviewed By: EdouardGrave Differential Revision: D17683908 Pulled By: Celebio fbshipit-source-id: cc891079a3e089b1730b5c770525bd850412923c --- docs/unsupervised-tutorials.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/docs/unsupervised-tutorials.md b/docs/unsupervised-tutorials.md index 33a6059e6..278c1e1ec 100644 --- a/docs/unsupervised-tutorials.md +++ b/docs/unsupervised-tutorials.md @@ -22,7 +22,7 @@ $ wget -c http://mattmahoney.net/dc/enwik9.zip -P data $ unzip data/enwik9.zip -d data ``` -A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/) ) +A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/)). ```bash $ perl wikifil.pl data/enwik9 > data/fil9 @@ -147,7 +147,7 @@ $ ./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim ``` -Depending on the quantity of data you have, you may want to change the parameters of the training. The *epoch* parameter controls how many time will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*). The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]: +Depending on the quantity of data you have, you may want to change the parameters of the training. The *epoch* parameter controls how many times the model will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*. The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]: @@ -180,7 +180,7 @@ $ ./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4 Searching and printing word vectors directly from the `fil9.vec` file is cumbersome. Fortunately, there is a `print-word-vectors` functionality in fastText. -For examples, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command: +For example, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command: ```bash @@ -226,7 +226,7 @@ $ echo "enviroment" | ./fasttext print-word-vectors result/fil9.bin ``` -You still get a word vector for it! But how good it is? Let s find out in the next sections! +You still get a word vector for it! But how good it is? Let's find out in the next sections! ## Nearest neighbor queries @@ -322,7 +322,11 @@ In order to find nearest neighbors, we need to compute a similarity score betwee ## Word analogies -In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, what Berlin is to Germany. +In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, and what Berlin is to Germany. + + + + This can be done with the *analogies* functionality. It takes a word triplet (like *Germany Berlin France*) and outputs the analogy: @@ -350,7 +354,7 @@ pigneaux 0.736122 ``` -The answer provides by our model is *Paris*, which is correct. Let's have a look at a less obvious example: +The answer provided by our model is *Paris*, which is correct. Let's have a look at a less obvious example: @@ -408,7 +412,7 @@ gearboxes 0.73986 Most of the retrieved words share substantial substrings but a few are actually quite different, like *cogwheel*. You can try other words like *sunbathe* or *grandnieces*. -Now that we have seen the interest of subword information for unknown words, let s check how it compares to a model that do not use subword information. To train a model without subwords, just run the following command: +Now that we have seen the interest of subword information for unknown words, let's check how it compares to a model that does not use subword information. To train a model without subwords, just run the following command: @@ -423,7 +427,7 @@ The results are saved in result/fil9-non.vec and result/fil9-non.bin. -To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation*. Here is the nearest neighbors obtained without subwords: +To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation**.* Here is the nearest neighbors obtained without subwords: @@ -476,4 +480,4 @@ The nearest neighbors capture different variation around the word *accommodation ## Conclusion -In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and you we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them. +In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them.