Skip to content

Commit c06ebe0

Browse files
author
Vineet John
committed
Added semeval details to the paper
1 parent a0cfea3 commit c06ebe0

File tree

2 files changed

+93
-1
lines changed

2 files changed

+93
-1
lines changed

project-report/cs698_project_report.bib

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,3 +100,45 @@ @inproceedings{pennington2014glove
100100
pages={1532--1543},
101101
year={2014}
102102
}
103+
104+
105+
@article{sparck1972statistical,
106+
title={A statistical interpretation of term specificity and its application in retrieval},
107+
author={Sparck Jones, Karen},
108+
journal={Journal of documentation},
109+
volume={28},
110+
number={1},
111+
pages={11--21},
112+
year={1972},
113+
publisher={MCB UP Ltd}
114+
}
115+
116+
@article{bojanowski2016enriching,
117+
title={Enriching Word Vectors with Subword Information},
118+
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
119+
journal={arXiv preprint arXiv:1607.04606},
120+
year={2016}
121+
}
122+
123+
@article{joulin2016bag,
124+
title={Bag of Tricks for Efficient Text Classification},
125+
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
126+
journal={arXiv preprint arXiv:1607.01759},
127+
year={2016}
128+
}
129+
130+
@inproceedings{le2014distributed,
131+
title={Distributed Representations of Sentences and Documents.},
132+
author={Le, Quoc V and Mikolov, Tomas},
133+
booktitle={ICML},
134+
volume={14},
135+
pages={1188--1196},
136+
year={2014}
137+
}
138+
139+
@article{SemEvalPaper,
140+
title={UW-FinSent at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial News Headlines},
141+
author={John, Vineet and Vechtomova, Olga},
142+
journal={Proceedings of the 11th international workshop on semantic evaluation},
143+
year={2017},
144+
}

project-report/cs698_project_report.tex

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,43 @@ \section{Goal} % (fold)
7878
% section goal (end)
7979

8080

81+
\section{Document Vectorization} % (fold)
82+
\label{sec:document_vectorization}
83+
84+
Document vectorization is needed to convert the text content of the SemEval headlines into a numeric vector representation that can be utilized as feature vectors, which can then be used to train a machine learning model on. For the purposes of this project, the methods for vectorization considered are listed in the subsections below. \cite{SemEvalPaper}
85+
86+
\subsection{N-gram Model} % (fold)
87+
\label{sub:n_gram_model}
88+
N-grams are contiguous sequences of `n' items from a given sequence of text or speech. Given a complete corpus of documents, each tuple of `n' grams, either characters or words are represented by a unique bit in a bit vector, which when aggregated for a body of text, form a sparse vectorized representation of the text in the form of n-gram occurrences.
89+
90+
% subsubsection n_gram_model (end)
91+
92+
\subsection{TF-IDF Model} % (fold)
93+
\label{sub:tf_idf_model}
94+
Term frequency-inverse document frequency (TF-IDF), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus \cite{sparck1972statistical}. The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. It is a bag-of-words model, and doesn't preserve word ordering.
95+
96+
% subsubsection tf_idf_model (end)
97+
98+
\subsection{Paragraph Vector Model} % (fold)
99+
\label{sub:paragraph_vectors_doc2vec}
100+
101+
A Paragraph Vector representation of text is an unsupervised learning algorithm that learns fixed-size vector representations for variable-length pieces of texts such as sentences and documents \cite{le2014distributed}. The vector representations are learned to predict the surrounding words in contexts sampled from the paragraph. In the context of the SemEval headlines, the vector representations were learned for the complete headline.
102+
103+
Two distinct implementations were explored while attempting to vectorize the headlines using the Paragraph Vector approach.
104+
\begin{itemize}
105+
\item
106+
Doc2Vec: A Python library implementation in Gensim\footnote{https://radimrehurek.com/gensim/models/doc2vec.html}.
107+
\item
108+
FastText: A standalone implementation in C++ \cite{bojanowski2016enriching} \cite{joulin2016bag}.
109+
\end{itemize}
110+
Doc2Vec was the final choice that was opted for due to the ease of integration into the existing system.
111+
112+
% subsection paragraph_vectors_doc2vec (end)
113+
114+
% section document_vectorization (end)
115+
116+
117+
81118
\section{A Primer of Neural Net Models for NLP} % (fold)
82119
\label{sec:a_primer_of_neural_net_models_for_nlp}
83120

@@ -180,7 +217,20 @@ \section{A Hierarchical Neural Autoencoder for Paragraphs and Documents} % (fold
180217
\section{Hierarchical Probabilistic Neural Network Language Model} % (fold)
181218
\label{sec:hierarchical_probabilistic_neural_network_language_model}
182219

183-
220+
\textbf{Goal}
221+
Implementing a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy
222+
223+
\textbf{Summary:}
224+
\begin{itemize}
225+
\item
226+
Similar to the previous paper, attempts to tackle the `curse of dimensionality' as in \ref{sec:a_neural_probabilistic_language_model}.
227+
\item
228+
It attempts to produce a much faster variant.
229+
\item
230+
Back-off n-grams are used for this approach too, and we attempt to learn a real-valued vector representation of each word.
231+
\item
232+
The word embeddings learnt are shared across all the participating nodes in the distributed architecture.
233+
\end{itemize}
184234

185235
% section hierarchical_probabilistic_neural_network_language_model (end)
186236

0 commit comments

Comments
 (0)