Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Commit

Permalink
Make minor grammatical improvements (#864)
Browse files Browse the repository at this point in the history
Summary:
I was following the tutorial on "Word representation" and thought the grammar could use a bit of polishing.
Pull Request resolved: #864

Reviewed By: EdouardGrave

Differential Revision: D17683908

Pulled By: Celebio

fbshipit-source-id: cc891079a3e089b1730b5c770525bd850412923c
  • Loading branch information
adl1995 authored and facebook-github-bot committed Mar 25, 2020
1 parent 022c1a7 commit 2d453cd
Showing 1 changed file with 13 additions and 9 deletions.
22 changes: 13 additions & 9 deletions docs/unsupervised-tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ $ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
$ unzip data/enwik9.zip -d data
```

A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/) )
A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/)).

```bash
$ perl wikifil.pl data/enwik9 > data/fil9
Expand Down Expand Up @@ -147,7 +147,7 @@ $ ./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim
```
<!--END_DOCUSAURUS_CODE_TABS-->

Depending on the quantity of data you have, you may want to change the parameters of the training. The *epoch* parameter controls how many time will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*). The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]:
Depending on the quantity of data you have, you may want to change the parameters of the training. The *epoch* parameter controls how many times the model will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*. The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]:

<!--DOCUSAURUS_CODE_TABS-->
<!--Command line-->
Expand Down Expand Up @@ -180,7 +180,7 @@ $ ./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4

Searching and printing word vectors directly from the `fil9.vec` file is cumbersome. Fortunately, there is a `print-word-vectors` functionality in fastText.

For examples, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command:
For example, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command:
<!--DOCUSAURUS_CODE_TABS-->
<!--Command line-->
```bash
Expand Down Expand Up @@ -226,7 +226,7 @@ $ echo "enviroment" | ./fasttext print-word-vectors result/fil9.bin
```
<!--END_DOCUSAURUS_CODE_TABS-->

You still get a word vector for it! But how good it is? Let s find out in the next sections!
You still get a word vector for it! But how good it is? Let's find out in the next sections!


## Nearest neighbor queries
Expand Down Expand Up @@ -322,7 +322,11 @@ In order to find nearest neighbors, we need to compute a similarity score betwee

## Word analogies

In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, what Berlin is to Germany.
In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, and what Berlin is to Germany.





This can be done with the *analogies* functionality. It takes a word triplet (like *Germany Berlin France*) and outputs the analogy:

Expand Down Expand Up @@ -350,7 +354,7 @@ pigneaux 0.736122
```
<!--END_DOCUSAURUS_CODE_TABS-->

The answer provides by our model is *Paris*, which is correct. Let's have a look at a less obvious example:
The answer provided by our model is *Paris*, which is correct. Let's have a look at a less obvious example:

<!--DOCUSAURUS_CODE_TABS-->
<!--Command line-->
Expand Down Expand Up @@ -408,7 +412,7 @@ gearboxes 0.73986

Most of the retrieved words share substantial substrings but a few are actually quite different, like *cogwheel*. You can try other words like *sunbathe* or *grandnieces*.

Now that we have seen the interest of subword information for unknown words, let s check how it compares to a model that do not use subword information. To train a model without subwords, just run the following command:
Now that we have seen the interest of subword information for unknown words, let's check how it compares to a model that does not use subword information. To train a model without subwords, just run the following command:

<!--DOCUSAURUS_CODE_TABS-->
<!--Command line-->
Expand All @@ -423,7 +427,7 @@ The results are saved in result/fil9-non.vec and result/fil9-non.bin.
<!--END_DOCUSAURUS_CODE_TABS-->


To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation*. Here is the nearest neighbors obtained without subwords:
To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation**.* Here is the nearest neighbors obtained without subwords:

<!--DOCUSAURUS_CODE_TABS-->
<!--Command line-->
Expand Down Expand Up @@ -476,4 +480,4 @@ The nearest neighbors capture different variation around the word *accommodation

## Conclusion

In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and you we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them.
In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them.

0 comments on commit 2d453cd

Please sign in to comment.