Index

Word Embeddings or Word vectorization
What is Word Embedding?
Word Embedding properties
Mathematically Representation
Word Vectorization or Word Embeddings types
- 1. One-hot Encoding
- 2. Word Embedding
  - Word2vec
    - 1) The Skip-Gram Model
    - 2) The Continuous Bag of Words (CBOW) Model
  - Glove

Word Embeddings or Word vectorization

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

What is Word Embedding?

Word embedding is just a fancy name for a feature vector that represents a word.
We can take a categorical object (a word in this case) and then map this object to a list of numbers in other words a vector we say we have embedded this word into a vector space.
So that's why we call them word embeddings.

Word Embedding properties

The operation (text to number) or vectorization is done either on "word" or on "character" level.
The process of converting words into numbers are called Vectorization.
This is one of the most important advances in Deep NLP research.
Word Emeddings allow you to map words into a vector space.
Once you can represent something as a vector, you can perform arithmetic on it.
So, this is where the famous phrases come from.
- king - man = queen - woman,
- December - Novemeber = July - June or
- France - Paris = England - London

Mathematically Representation

Embedding mathematically represents a mapping , f: X → Y, which is a function.
Where the function is -
- injective (which is what we call an injective function , each Y has a unique X correspondence, and vice versa)
- structure-preserving ( structure preservation , for example, X1 < X2 in the space to which X belongs, then the same applies to Y1 <Y2 in the space to which Y belongs after mapping).
So for word embedding, the word word is mapped to another space, where this mapping has the characteristics of injective and structure-preserving.

Word Vectorization or Word Embeddings types

The neural network cannot train the original text data.
We need to process the text data into numerical tensors first.
This process is also called text vectorization.
There are several strategies for text vectorization:
- Split text into words, each word is converted into a vector
- Split text into characters, each character is converted into a vector
- Extract n-gram of words or characters n-gram to a vector
There are two main methods for word vectorization:
1. One-hot Encoding
2. Word Embedding

1. One-hot Encoding

Why is it called one-hot?
- After each word is one-hot encoded, only one position has an element of 1 and the other positions are all 0.
Example: "the boy is crying" (assuming there are only four English words in the world), after one-hot encoding,
- the corresponds to (1, 0, 0, 0)
- boy corresponds to (0, 1, 0, 0）
- is corresponds to (0, 0, 1, 0)
- crying corresponds to (0, 0, 0, 1)
Each word corresponds to a position in the vector, and this position represents the word.

Avoid using One-hot Encoding

This way requires a very high dimension, because if all vocabularies have 100,000 words, then each word needs to be represented by a vector of length 100,000.
- the corresponds to (1, 0, 0, 0, ..., 0) (length is 100,000)
- And so on, to get high-dimensional sparse tensors.

Why One-Hot Vectors are bad?

A main reason is that one-hot word vectors cannot accurately express the similarity between different words, such as the cosine similarity that we often use.
Since the cosine similarity between one-hot vectors of any two different words is 0, one-hot vectors cannot encode similarities among words.

2. Word Embedding

The word2vec tool was proposed to address the issue with using One-Hot Vector.

In contrast, Word Embedding embeds words into a low-dimensional dense space.
Example: the same "the boy is crying" sentence (assuming that there are only 4 English words in the world), after encoding, it may become:
- the corresponds to (0.1)
- boy corresponds to (0.14)
- is corresponds to (0)
- crying corresponds to (0.82)
We assume that the embedded space is 256 dimensions (generally 256, 512 or 1024 dimensions, the larger the vocabulary, the higher the corresponding spatial dimension)
Then the corresponding ( 0.1, 0.2, 0.4, 0 , ...) (vector length is 256) boy corresponds to (0.23, 0.14, 0, 0 , ...) is corresponding to (0, 0 , 0.41, 0.9, ...) , 0.82, 0, 0.14, ...)
But in practice, the word (boy) to word (man) should be very close (because they are closely related), and the word (cat) to word (stone) should be very far (because they are basically unrelated).
Embedding space has low dimensions and allows space to have structure .
For example, the distance between the vectors can reflect gender, age, etc. (this requires training, and the unembedding layer has no structure), for example:
- man-woman = boy-girl
- man-daddy = woman-mother
There are two most popular algorithms for finding Word Embeddings-
1. Word2vac
2. GloVe

Word2vec

Word2Vec is a shallow, two-layer Neural Networks which is trained to reconstruct linguistic contexts of words.
It takes a large corpus of words as its input and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
Word2Vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text.
It comes in two flavors, the Continuous Bag-of-Words (CBOW) model and the Skip-Gram model.
Algorithmically, these models are similar.

We Don't train our own Word2vec model. always use pretrained one

Coz the bias terms we dont know. Its very difficult to odentify the right hyperparameters. Its

Learning Rate is very important
The word2vec tool contains two models, namely-
1. Skip-gram [Mikolov et al]= We predict the context words from the target
2. Continuous bag of words (CBOW) [Mikolov et al]= We predict the target word from the context.
For semantically meaningful representations, their training relies on conditional probabilities that can be viewed as predicting some words using some of their surrounding words in corpora.

Word2vec is a Self-Supervised learning

Since supervision comes from the data without labels, both skip-gram and continuous bag of words are self-supervised models.

1) The Skip-Gram Model

The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence.

Take the text sequence “the”, “man”, “loves”, “his”, “son” as an example.

Let us choose “loves” as the center word and set the context window size to 2.

Given the center word “loves”, the skip-gram model considers the conditional probability for generating the context words:

“the”, “man”, “his”, and “son”, which are no more than 2 words away from the center word:

$P(\textrm{''the''}, \textrm{''man''}, \textrm{''his''}, \textrm{''son''} \mid \textrm{''loves''})$

Assume that the context words are independently generated given the center word (i.e., conditional independence).
In this case, the above conditional probability can be rewritten as

$\mathrm{P(\textrm{''the''}\mid\textrm{''loves''})\cdot P(\textrm{''man''}\mid\textrm{''loves''})\cdot P(\textrm{''his''}\mid\textrm{''loves''})\cdot P(\textrm{''son''}\mid\textrm{''loves''}).}$

2) The Continuous Bag of Words (CBOW) Model

CBOW model is similar to the skip-gram model.
The major difference from the skip-gram model is that the continuous bag of words model assumes that a center word is generated based on its surrounding context words in the text sequence.
For example, in the same text sequence “the”, “man”, “loves”, “his”, and “son”, with “loves” as the center word and the context window size being 2,
The continuous bag of words model considers the conditional probability of generating the center word “loves” based on the context words “the”, “man”, “his” and “son”-

The "Window size" or "The token of interest" is generally 4.

Example 1:

For a sentence -“the”, “man”, “loves”, “his”, and “son”.
It will generate like this

⇒ $http://latex.codecogs.com/svg.latex?\begin{matrix}The, & Man, & Loves, & His, & Family \\\downarrow& \downarrow & \downarrow & \downarrow & \\W_{t-2}&W_{t-1}&W_{t+1}&W_{t+2}&\end{matrix}$

⇒ $http://latex.codecogs.com/svg.latex?W_1\rightarrow Embedding \rightarrow [\ \]_{(k)} \rightarrow Flattening\ Op\rightarrow Fully\ Connected\ Layer\left (Softmax \to \right ) \begin{Bmatrix} W_t\end{matrix}_{(Vocab\ size)}$

$http://latex.codecogs.com/svg.latex?\\W_1\rightarrow Embedding \rightarrow [\ \]_{(k)} \rightarrow Flattening\ Op\rightarrow Fully\ Connected\ Layer\left (Softmax \to \right ) \begin{Bmatrix} W_t\end{matrix}_{(Vocab\ size)}\\{\color{Red}\underbrace{W_1\rightarrow Embedding \rightarrow [\ \]_{(k)}} }\\ \\{\color{Blue} This\ section\ is\ Fixed \ or \ Pre-Computed \ or \ Weight Frizer}$

Improving the accuracy

Choice of Model architecture (CBOW/Skipgram)
- Large Corpus, higher dimensions, slower- skipgram
- Small Corpus, Faster - CBOW
Increasing the training dataset.
Increasing the vector dimensions
Increasing the windows size.
Smaller dataset is < 30 MB

How do they work internally?

* We take _One-hot-Vector_ for each words from the sliding window at a time as a input to the neural network. * For CBOW our predicted value would be 'Love' in this case campared with the actual word * This is how we get the weight matrix * In Skip-gram the target word is input the y would be the context words(just opposite of CBOW) ### Word2vec Archetecture

Negative sampling

Negative sampling reduces computation by sampling just N negative instances along with the target word instead of sampling the whole vocabulary.
Technically, negative sampling ignores most of the ‘0’ in the one-hot label word vector, and only propagates and updates the weights for the target and a few negative classes which were randomly sampled.
More concretely, negative sampling samples negative instances(words) along with the target word and minimizes the log-likelihood of the sampled negative instances while maximizing the log-likelihood of the target word.

Samples very near to 0 is treated as negative (-ve) and samples are very far from 0 is treated as positive(+ve).

#### How the negative samples are chosen? * The negative samples are chosen using a unigram distribution. * Essentially, the probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples. * Specifically, each word is given a weight equal to it’s frequency (word count) raised to the 3/4 power. The probability for a selecting a word is just it’s weight divided by the sum of weights for all words.

Sub-sampling

Some frequent words often provide little information. Words with frequency above a certain threshold (e.g ‘a’, ‘an’ and ‘that’) may be subsampled to increase training speed and performance. Also, common word pairs or phrases may be treated as single “words” to increase training speed.

Context window

The size of the context window determines how many words before and after a given word would be included as context words of the given word. According to the authors’ note, the recommended value is 10 for skip-gram and 5 for CBOW.
Here is an example of Skip-Gram with context window of size 2:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!