Name	Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
Report for Application of PCA in Text Document Clustering.pdf	Report for Application of PCA in Text Document Clustering.pdf
Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization.pdf	Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization.pdf
Text_Document_Clustering_Using_Spectral_Clustering_with_Particle_Swarm_Optimization.ipynb	Text_Document_Clustering_Using_Spectral_Clustering_with_Particle_Swarm_Optimization.ipynb

Name

Last commit message

Last commit date

.gitignore

LICENSE

README.md

Report for Application of PCA in Text Document Clustering.pdf

Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization.pdf

Text_Document_Clustering_Using_Spectral_Clustering_with_Particle_Swarm_Optimization.ipynb

Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization (inclusion of PCA)

Team Members -
VEDANT MEHRA - 171CO250
RISHI SHARMA - 171CO135

Document clustering is a gathering of textual content documents into groups or clusters. The main aim is to cluster the documents, which are internally logical but considerably different from each other. It is a crucial process used in information retrieval, information extraction and document organization. In recent years, the spectral clustering is widely applied in the field of machine learning as an innovative clustering technique. This research work proposes a novel Spectral Clustering algorithm with Particle Swarm Optimization (SCPSO) to improve the text document clustering. By considering global and local optimization function, the randomization is carried out with the initial population. This research work aims at combining the spectral clustering with swarm optimization to deal with the huge volume of text documents. The proposed algorithm SCPSO is examined with the benchmark database against the other existing approaches. The proposed algorithm SCPSO is compared with the Spherical K-means, Expectation Maximization Method (EM) and standard PSO Algorithm. The concluding results show that the proposed SCPSO algorithm yields better clustering accuracy than other clustering techniques

Install libraries / Set up environment

Install jupyter notebook - https://jupyter.org/install
Install Anaconda - https://docs.anaconda.com/anaconda/install/
NLTK - https://www.nltk.org/install.html
gensim installation command in jupyter notebook - !pip install --upgrade gensim
numpy, pandas, matplotlib, scikitlearn are already included in Anaconda.

How to run the code

Clone this repository and open the notebook in jupyter notebook.
Now one can run each and every cell of the notebook. Furthur details of what each section of the code contains is given below. The first section of the code contains preliminary work which is needed. We __*imported the libraries*__ that are required, then __*downloading stopwords*__ and the function for creating the __*tf-idf vector*__.
The next section will __*use the imported Reuters dataset*__ and divide it into training and testing data, form the tf-idf vector from the training data.
Now the section of __*Visualization*__ has importing gensim library, tokenising the single document text, converting the tokenised vector to pandas dataframe and then visualising the word embeddings. Then we move to the __*Particle Swarm Optimization*__ section where we have a function for PSO algorithm.
The next section is for __*Spectral Clustering*__ which will import necessary libraries, fit the data and calculate the Adjusted Random Index (ARI).
The next section is __*our own ideas*__ which involves the idea of using __*Principle Component Analysis(PCA) on Affinity matrix with Euclidean Distance*__. Here we applied PCA on Affinity matrix with Euclidean Distance and then calculated the ARI for the model.
The next section has our other idea which is to use __*Principle Component Analysis(PCA) on Affinity matrix with Gaussian Kernel*__. Here we applied PCA on Affinity matrix with Gaussian Kernel and then calculated the ARI for the model.
The last section is the __*Comparison of Adjusted Rand Index*__ for various models.

About

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization (inclusion of PCA)

Install libraries / Set up environment

How to run the code

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

kampaitees/Text-document-clustering-using-Spectral-Clustering-algorithm-with-Particle-Swarm-Optimization

Folders and files

Latest commit

History

Repository files navigation

Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization (inclusion of PCA)

Install libraries / Set up environment

How to run the code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages