You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implementing a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy
217
+
\textbf{Goal}\\
218
+
Implementing a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.\\
The paper aims to address the inaccuracy in vector representations of complex and rare words, supposedly caused by the lack of relation between morphologically related words. \cite{luong2013better}
240
+
\textbf{Goal:}\\
241
+
The paper aims to address the inaccuracy in vector representations of complex and rare words, supposedly caused by the lack of relation between morphologically related words. \cite{luong2013better}\\
242
242
243
243
\textbf{Approach:}
244
244
\begin{itemize}
@@ -279,8 +279,8 @@ \section{Better Word Representations with Recursive Neural Networks for Morpholo
279
279
\section{Efficient Estimation of Word Representations in Vector Space} % (fold)
The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. This is one of the seminal papers that led to the creation of Word2Vec, which is a state-of-the-art word embeddding tool. \cite{mikolov2013efficient}
282
+
\textbf{Goal:}\\
283
+
The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. This is one of the seminal papers that led to the creation of Word2Vec, which is a state-of-the-art word embeddding tool. \cite{mikolov2013efficient}\\
284
284
285
285
\textbf{Approach:}
286
286
\begin{itemize}
@@ -300,7 +300,13 @@ \section{Efficient Estimation of Word Representations in Vector Space} % (fold)
300
300
\item
301
301
To allow for the distributed training of the data, the framework DistBelief was used with mutiple replicas of the model. Adagrad was utilized for asynchronous gradient descent.
302
302
\item
303
-
Two distint models were conceptualized for the training of the word vectors based on context, both of which are continous and distributed representations of words.
303
+
Two distint models were conceptualized for the training of the word vectors based on context, both of which are continous and distributed representations of words. These are illustrated in Figure
Continuous Bag-of-Words model: This model uses the context of a word i.e. the words that precede and follow it, to be able to predict the current word.
@@ -323,19 +329,62 @@ \section{Efficient Estimation of Word Representations in Vector Space} % (fold)
323
329
\section{Distributed Representations of Words and Phrases and their Compositionality} % (fold)
This paper builds upon the idea of the Word2Vec skip-gram model, and presents optimizations in terms of quality of the word embeddings as well as speed-ups while training. It also proposes an alternative to the hierarchical softmax final layer, called negative sampling.
332
+
\textbf{Goal:}\\
333
+
This paper builds upon the idea of the Word2Vec skip-gram model, and presents optimizations in terms of quality of the word embeddings as well as speed-ups while training. It also proposes an alternative to the hierarchical softmax final layer, called negative sampling.\\
328
334
329
335
\textbf{Approach:}
330
336
\begin{itemize}
331
337
\item
332
-
338
+
One of the optimizations suggested is to sub-sample the training set words to achieve a speed-up in training the words.
339
+
\item
340
+
Given a sequence of training words $w_1 , w_2 , w_3 , . . . , w_T$ , the objective of the Skip-gram model is to maximize the average log probability
where $c$ is the window or context surrounding the current word being trained on.
345
+
\item
346
+
As introduced by Morin, Bengio et.al.\cite{morin2005hierarchical}, a computationally efficient approximation of the full softmax is the hierarchical softmax. The hierarchical softmax uses a binary tree representation of the output layer with the W words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.
347
+
\item
348
+
The authors use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models.
349
+
\item
350
+
Noise Contrastive Estimation (NCE), which is an alternative to hierarchical softmax, posits that a good model should be able to differentiate data from noise by means of logistic regression.
351
+
\item
352
+
To counter the imbalance between the rare and frequent words, we used a simple subsampling approach: each word wi in the training set is discarded with probability computed by the below formula.
353
+
$$P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$
354
+
This is similar to a dropout of neurons from the network, except that it is statistically more likely that frequent words are removed from the corpus by virtue of this method.
355
+
\item
356
+
Discarding the frequently occurring words allows for a reduction in computational and memory cost.
357
+
\item
358
+
The individual words can easily be coalesced into phrases using unigram and bigram frequency counts, as shown below.
Another interesting property of learning these distributed representations is that the word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetic.
0 commit comments