- Word Embeddings or Word vectorization
- What is Word Embedding?
- Word Embedding properties
- Mathematically Representation
- Word Vectorization or Word Embeddings types
- Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.
- Word embedding is just a fancy name for a
feature vector
that represents a word. - We can take a categorical object (a word in this case) and then map this object to a list of numbers in other words a vector we say we have embedded this word into a vector space.
- So that's why we call them word embeddings.
- The operation (text to number) or vectorization is done either on "word" or on "character" level.
- The process of converting words into numbers are called Vectorization.
- This is one of the most important advances in Deep NLP research.
- Word Emeddings allow you to map words into a vector space.
- Once you can represent something as a vector, you can perform arithmetic on it.
- So, this is where the famous phrases come from.
- king - man = queen - woman,
- December - Novemeber = July - June or
- France - Paris = England - London
-
Embedding mathematically represents a mapping , f: X → Y, which is a function.
-
Where the function is -
- injective (which is what we call an injective function , each Y has a unique X correspondence, and vice versa)
- structure-preserving ( structure preservation , for example, X1 < X2 in the space to which X belongs, then the same applies to Y1 <Y2 in the space to which Y belongs after mapping).
-
So for word embedding, the word word is mapped to another space, where this mapping has the characteristics of injective and structure-preserving.
- The neural network cannot train the original text data.
- We need to process the text data into numerical tensors first.
- This process is also called text vectorization.
- There are several strategies for text vectorization:
- Split text into words, each word is converted into a vector
- Split text into characters, each character is converted into a vector
- Extract n-gram of words or characters n-gram to a vector
- There are two main methods for word vectorization:
- One-hot Encoding
- Word Embedding
- Why is it called one-hot?
- After each word is one-hot encoded, only one position has an element of 1 and the other positions are all 0.
- Example: "the boy is crying" (assuming there are only four English words in the world), after one-hot encoding,
- the corresponds to (1, 0, 0, 0)
- boy corresponds to (0, 1, 0, 0)
- is corresponds to (0, 0, 1, 0)
- crying corresponds to (0, 0, 0, 1)
- Each word corresponds to a position in the vector, and this position represents the word.
- This way requires a very high dimension, because if all vocabularies have 100,000 words, then each word needs to be represented by a vector of length 100,000.
- the corresponds to (1, 0, 0, 0, ..., 0) (length is 100,000)
- And so on, to get high-dimensional sparse tensors.
- A main reason is that one-hot word vectors cannot accurately express the similarity between different words, such as the cosine similarity that we often use.
- Since the cosine similarity between one-hot vectors of any two different words is 0, one-hot vectors cannot encode similarities among words.
- In contrast, Word Embedding embeds words into a low-dimensional dense space.
- Example: the same "the boy is crying" sentence (assuming that there are only 4 English words in the world), after encoding, it may become:
- the corresponds to (0.1)
- boy corresponds to (0.14)
- is corresponds to (0)
- crying corresponds to (0.82)
- We assume that the embedded space is 256 dimensions (generally 256, 512 or 1024 dimensions, the larger the vocabulary, the higher the corresponding spatial dimension)
- Then the corresponding ( 0.1, 0.2, 0.4, 0 , ...) (vector length is 256) boy corresponds to (0.23, 0.14, 0, 0 , ...) is corresponding to (0, 0 , 0.41, 0.9, ...) , 0.82, 0, 0.14, ...)
- But in practice, the word (boy) to word (man) should be very close (because they are closely related), and the word (cat) to word (stone) should be very far (because they are basically unrelated).
- Embedding space has low dimensions and allows space to have structure .
- For example, the distance between the vectors can reflect gender, age, etc. (this requires training, and the unembedding layer has no structure), for example:
- man-woman = boy-girl
- man-daddy = woman-mother
- There are two most popular algorithms for finding Word Embeddings-
- Word2vac
- GloVe
- Word2Vec is a shallow, two-layer Neural Networks which is trained to reconstruct linguistic contexts of words.
- It takes a large corpus of words as its input and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
- Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
- Word2Vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text.
- It comes in two flavors, the Continuous Bag-of-Words (CBOW) model and the Skip-Gram model.
- Algorithmically, these models are similar.
Coz the bias terms we dont know. Its very difficult to odentify the right hyperparameters. Its
- Learning Rate is very important
- The word2vec tool contains two models, namely-
- Skip-gram [
Mikolov et al
]= We predict the context words from the target - Continuous bag of words (CBOW) [
Mikolov et al
]= We predict the target word from the context.
- Skip-gram [
- For semantically meaningful representations, their training relies on conditional probabilities that can be viewed as predicting some words using some of their surrounding words in corpora.
Since supervision comes from the data without labels, both skip-gram and continuous bag of words are self-supervised models.
- The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence.
- Let us choose “loves” as the center word and set the context window size to 2.
- Given the center word “loves”, the skip-gram model considers the conditional probability for generating the context words:
- “the”, “man”, “his”, and “son”, which are no more than 2 words away from the center word:
- Assume that the context words are independently generated given the center word (i.e., conditional independence).
- In this case, the above conditional probability can be rewritten as
- CBOW model is similar to the skip-gram model.
- The major difference from the skip-gram model is that the continuous bag of words model assumes that a center word is generated based on its surrounding context words in the text sequence.
- For example, in the same text sequence “the”, “man”, “loves”, “his”, and “son”, with “loves” as the center word and the context window size being 2,
- The continuous bag of words model considers the conditional probability of generating the center word “loves” based on the context words “the”, “man”, “his” and “son”-
- The "Window size" or "The token of interest" is generally 4.
- For a sentence -“the”, “man”, “loves”, “his”, and “son”.
- It will generate like this
- Choice of Model architecture (CBOW/Skipgram)
- Large Corpus, higher dimensions, slower- skipgram
- Small Corpus, Faster - CBOW
- Increasing the training dataset.
- Increasing the vector dimensions
- Increasing the windows size.
- Smaller dataset is < 30 MB
- Negative sampling reduces computation by sampling just N negative instances along with the target word instead of sampling the whole vocabulary.
- Technically, negative sampling ignores most of the ‘0’ in the one-hot label word vector, and only propagates and updates the weights for the target and a few negative classes which were randomly sampled.
- More concretely, negative sampling samples negative instances(words) along with the target word and minimizes the log-likelihood of the sampled negative instances while maximizing the log-likelihood of the target word.
#### How the negative samples are chosen? * The negative samples are chosen using a unigram distribution. * Essentially, the probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples. * Specifically, each word is given a weight equal to it’s frequency (word count) raised to the 3/4 power. The probability for a selecting a word is just it’s weight divided by the sum of weights for all words.
Some frequent words often provide little information. Words with frequency above a certain threshold (e.g ‘a’, ‘an’ and ‘that’) may be subsampled to increase training speed and performance. Also, common word pairs or phrases may be treated as single “words” to increase training speed.
- The size of the context window determines how many words before and after a given word would be included as context words of the given word. According to the authors’ note, the recommended value is 10 for skip-gram and 5 for CBOW.
- Here is an example of Skip-Gram with context window of size 2: