Skip to content

Commit 7578616

Browse files
author
Vineet John
committed
Added notes for Distributed Representations of Words and Phrases and their Compositionality
1 parent 4a3096f commit 7578616

File tree

2 files changed

+60
-11
lines changed

2 files changed

+60
-11
lines changed

project-report/cbow-skipgram.png

-27.4 KB
Loading

project-report/cs698_project_report.tex

Lines changed: 60 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
\setlength\titlebox{5cm}
1515

16-
\title{A Survey of Neural Network Techniques for Feature Extraction from Text}
16+
\title{A Survey of Neural Network Techniques\\for Feature Extraction from Text}
1717

1818
\author{
1919
Vineet John \\
@@ -214,8 +214,8 @@ \section{A Hierarchical Neural Autoencoder for Paragraphs and Documents} % (fold
214214
\section{Hierarchical Probabilistic Neural Network Language Model} % (fold)
215215
\label{sec:hierarchical_probabilistic_neural_network_language_model}
216216

217-
\textbf{Goal}
218-
Implementing a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy
217+
\textbf{Goal}\\
218+
Implementing a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.\\
219219

220220
\textbf{Summary:}
221221
\begin{itemize}
@@ -237,8 +237,8 @@ \section{Hierarchical Probabilistic Neural Network Language Model} % (fold)
237237
\section{Better Word Representations with Recursive Neural Networks for Morphology} % (fold)
238238
\label{sec:better_word_representations_with_recursive_neural_networks_for_morphology}
239239

240-
\textbf{Goal:}
241-
The paper aims to address the inaccuracy in vector representations of complex and rare words, supposedly caused by the lack of relation between morphologically related words. \cite{luong2013better}
240+
\textbf{Goal:}\\
241+
The paper aims to address the inaccuracy in vector representations of complex and rare words, supposedly caused by the lack of relation between morphologically related words. \cite{luong2013better}\\
242242

243243
\textbf{Approach:}
244244
\begin{itemize}
@@ -279,8 +279,8 @@ \section{Better Word Representations with Recursive Neural Networks for Morpholo
279279
\section{Efficient Estimation of Word Representations in Vector Space} % (fold)
280280
\label{sec:efficient_estimation_of_word_representations_in_vector_space}
281281

282-
\textbf{Goal:}
283-
The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. This is one of the seminal papers that led to the creation of Word2Vec, which is a state-of-the-art word embeddding tool. \cite{mikolov2013efficient}
282+
\textbf{Goal:}\\
283+
The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. This is one of the seminal papers that led to the creation of Word2Vec, which is a state-of-the-art word embeddding tool. \cite{mikolov2013efficient}\\
284284

285285
\textbf{Approach:}
286286
\begin{itemize}
@@ -300,7 +300,13 @@ \section{Efficient Estimation of Word Representations in Vector Space} % (fold)
300300
\item
301301
To allow for the distributed training of the data, the framework DistBelief was used with mutiple replicas of the model. Adagrad was utilized for asynchronous gradient descent.
302302
\item
303-
Two distint models were conceptualized for the training of the word vectors based on context, both of which are continous and distributed representations of words.
303+
Two distint models were conceptualized for the training of the word vectors based on context, both of which are continous and distributed representations of words. These are illustrated in Figure
304+
\begin{figure}[ht]
305+
\centering
306+
\includegraphics[width=.5\textwidth]{cbow-skipgram}
307+
\caption{CBOW and Skipgram models}
308+
\label{fig:cbow-skipgram}
309+
\end{figure}
304310
\begin{itemize}
305311
\item
306312
Continuous Bag-of-Words model: This model uses the context of a word i.e. the words that precede and follow it, to be able to predict the current word.
@@ -323,19 +329,62 @@ \section{Efficient Estimation of Word Representations in Vector Space} % (fold)
323329
\section{Distributed Representations of Words and Phrases and their Compositionality} % (fold)
324330
\label{sec:distributed_representations_of_words_and_phrases_and_their_compositionality}
325331

326-
\textbf{Goal:}
327-
This paper builds upon the idea of the Word2Vec skip-gram model, and presents optimizations in terms of quality of the word embeddings as well as speed-ups while training. It also proposes an alternative to the hierarchical softmax final layer, called negative sampling.
332+
\textbf{Goal:}\\
333+
This paper builds upon the idea of the Word2Vec skip-gram model, and presents optimizations in terms of quality of the word embeddings as well as speed-ups while training. It also proposes an alternative to the hierarchical softmax final layer, called negative sampling.\\
328334

329335
\textbf{Approach:}
330336
\begin{itemize}
331337
\item
332-
338+
One of the optimizations suggested is to sub-sample the training set words to achieve a speed-up in training the words.
339+
\item
340+
Given a sequence of training words $w_1 , w_2 , w_3 , . . . , w_T$ , the objective of the Skip-gram model is to maximize the average log probability
341+
\begin{equation}
342+
\frac{1}{T} \sum_{t=1}^T \sum_{-c \leq j \leq c; j \neq 0} \log P(w_{t+j}, w_t)
343+
\end{equation}
344+
where $c$ is the window or context surrounding the current word being trained on.
345+
\item
346+
As introduced by Morin, Bengio et.al.\cite{morin2005hierarchical}, a computationally efficient approximation of the full softmax is the hierarchical softmax. The hierarchical softmax uses a binary tree representation of the output layer with the W words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.
347+
\item
348+
The authors use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models.
349+
\item
350+
Noise Contrastive Estimation (NCE), which is an alternative to hierarchical softmax, posits that a good model should be able to differentiate data from noise by means of logistic regression.
351+
\item
352+
To counter the imbalance between the rare and frequent words, we used a simple subsampling approach: each word wi in the training set is discarded with probability computed by the below formula.
353+
$$P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$
354+
This is similar to a dropout of neurons from the network, except that it is statistically more likely that frequent words are removed from the corpus by virtue of this method.
355+
\item
356+
Discarding the frequently occurring words allows for a reduction in computational and memory cost.
357+
\item
358+
The individual words can easily be coalesced into phrases using unigram and bigram frequency counts, as shown below.
359+
$$score(w_i, w_j) = \frac{count(w_i w_j) - \delta}{count(w_i) * count(w_j)} $$
360+
\item
361+
Another interesting property of learning these distributed representations is that the word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetic.
333362
\end{itemize}
334363

335364

336365
% section distributed_representations_of_words_and_phrases_and_their_compositionality (end)
337366

338367

368+
369+
\section{Linguistic Regularities in Continuous Space Word Representations} % (fold)
370+
\label{sec:linguistic_regularities_in_continuous_space_word_representations}
371+
372+
373+
\textbf{Goal:}\\
374+
\\
375+
376+
\textbf{Approach:}
377+
\begin{itemize}
378+
\item
379+
380+
\end{itemize}
381+
382+
383+
384+
% section linguistic_regularities_in_continuous_space_word_representations (end)
385+
386+
387+
339388
\newpage
340389

341390
\bibliographystyle{unsrt}

0 commit comments

Comments
 (0)