=============
This project has various phases which involves the sequence of steps.
- Web Crawling
Crawling the web forums to get the appropriate data and store them in flat files
- Pre-Processing the data
Data Pre-Processing involves organizing the noisey data and inappropriate data into appropriate format for Pos Tagging
- PoS Tagging
Parts of Speech Tagging is done to the processed data using a few standard Pos Taggers like Stanford PoS Tagger, OpenNLP Tagger, LTAG-Spinal etc
- Stop-Word Removal
Stop-Word removal includes removal of unmeaning full words, common words etc
- Stemming & Lemmatization
Stemming includes removal of similar words and base line them to a single meaningful word. For e.g. running,run can be stemmed to single word "run". Lemmatisation (or lemmatization) is the process of grouping together the different inflected forms of a word so they can be analysed as a single item includes removal of similar words and base line them to a single meaningful word. For e.g. running, ran, run can be lemmatized to single word "run".
- Pruning
Low frequency words are removed from word list.
- Weighting
Weightage is given to each and every term inside the document by calculating "tfidf". It is the product of term frequency and inverse document frequency. tf idft = tf · (log 2 n − log 2 dft + 1)
tf - term frequency
dft - the number of documents in which term 't' appears
n - no.of documents
- Cosine Similarity
Cosine distnace between two document vectors s(d i , d j ) = cos( ( d i , d j )) = di·dj / |di|·|dj|
Cosine Similarity(Doc1,Doc2) = Dot product(Doc1,Doc2) / ||Doc1||*||Doc2||
- Clustering
Apply clustering algorithm to form Clusters.
You need to checkout web-crawler project as well in order to work with DataMining. DataMining project has internally dependencies on web-crawler project.
First build web-crawler project and then build DataMining project.