Text documents, such as crawled web data, are usually comprised of topically coherent text data, which within each topically coherent data, one would expect that the word usage demonstrates more consistent lexical distributions than that across data-set. A linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR (information retrieval), document summarization, recommender systems, and learning-to-rank methods.
In order to perform such text analysis tasks, this project firstly extracts text data from tweets saved in an XML format. Seconldy, a set of text pre-processing steps are undertaken including:
-
Generate the corpus vocabulary with the same structure as sample_vocab.txt. Please note that the vocabulary must be sorted alphabetically.
-
For each day (i.e., sheet in your excel file), calculate the top 100 frequent unigram and top-100 frequent bigrams according to the structure of the sample_100uni.txt and sample_100bi.txt. If you have less than 100 bigrams for a particular day, just include the top-n bigrams for that day (n<100).
-
Generate the sparse representation (i.e., doc-term matrix) of the excel file according to the structure of the sample_countVec.txt
The output of this processing is saved in the files:
- vocab.txt
- countVec.txt
- 100uni.txt
- 100bi.txt